NVIDIA GPU Host Setup
Use this setup skill before TAO workflows run on the docker, local-docker,
or kubernetes backend. It standardizes the host GPU runtime on:
- NVIDIA driver branch
580(open kernel module preferred) - CUDA Toolkit package
cuda-toolkit-13-0 - NVIDIA Container Toolkit
1.19.0 - Docker engine — only installed for
docker/local-dockerbackends and only when Docker is missing. The package picked depends on the distro family (docker.ioon Debian-family by default,moby-engine/docker-cefromdownload.docker.comon RHEL-family,dockeron SUSE-family). Pass--skip-docker-installto opt out.
The check is safe and read-only by default — it works on any Linux
distribution because it only probes nvidia-smi, the CUDA toolkit path,
the installed container-toolkit package version (via dpkg/rpm/the
nvidia-ctk binary version), and the Docker daemon's NVIDIA runtime.
Installation must be explicitly authorized by the user and rerun with
--install. The install path is automated for these distro families:
| Family | Tested distros | Manager | Notes |
|---|---|---|---|
| debian | Ubuntu 22.04 / 24.04, Debian 12 (and derivatives Pop!_OS, Mint, Zorin, Raspbian, KDE Neon, etc. via UBUNTU_CODENAME / VERSION_CODENAME) | apt-get | Adds NVIDIA cuda-keyring + Container Toolkit .list. Docker via docker.io (override $DOCKER_PACKAGE_DEBIAN). |
| rhel | Fedora 39+, RHEL / Rocky / AlmaLinux 9 and 10 | dnf (or yum) | Adds NVIDIA cuda-<distro>.repo + Container Toolkit .repo. Docker via Fedora moby-engine when available, otherwise docker-ce from download.docker.com. |
| suse | openSUSE Leap 15, SLES 15 | zypper | Adds the same NVIDIA .repo files. Docker via the distribution docker package. |
| other (Arch, Alpine, Gentoo, NixOS, FreeBSD, …) | n/a | n/a | --install exits with a clear error listing the version targets and the NVIDIA install-guide URLs. Install manually, then rerun --check-only. |
Quick Start
From the skill bank root:
# Check the local Docker backend host.
bash skills/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend docker --check-only
# Install or repair after user approval (prompts for confirmation; see the note below for non-interactive runs).
bash skills/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend docker --install
# Check a Kubernetes GPU worker host.
bash skills/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend kubernetes --check-only
⚠️ Note — running non-interactively (agent / skill runs): a skill run has no terminal, so the installer's
Continue? [y/N]confirmation cannot be answered. After running--check-onlyto preview what is missing and getting the user's explicit approval, append the assume-yes flag (--yes) to the--installcommand so it proceeds without a prompt. That auto-confirms installation of system packages (NVIDIA driver branch 580, CUDA Toolkit 13.0, NVIDIA Container Toolkit, and — for Docker backends — Docker) and modifies the host: it adds NVIDIA package repositories, may restart Docker, and adds the invoking user to thedockergroup, so only do this on a host you control and have the privileges to change. When a person runs--installdirectly at a terminal, the script instead prompts with the exact package list before making any changes.
In an installed plugin copy that exposes skills/, use:
bash skills/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend docker --check-only
Workflow Contract
Docker and Kubernetes workflows must run the check before submitting GPU work:
SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT:-$PWD}/skills/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"
[ -x "$SETUP_SCRIPT" ] || SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT:-$PWD}/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"
bash "$SETUP_SCRIPT" --backend docker --check-only || {
echo "MISSING: TAO GPU host runtime is not ready."
echo "After user approval, run (append --yes for non-interactive agent runs):"
echo " bash \"$SETUP_SCRIPT\" --backend docker --install"
exit 1
}
Never install silently. If the check fails, explain what is missing, ask the user to authorize the fix, then run the install command and rerun the check.
What The Installer Does
The installer dispatches on the detected distribution family. On every
supported family it adds NVIDIA's CUDA and Container Toolkit repositories
(if missing), installs the pinned runtime packages, optionally installs
Docker, wires the NVIDIA Docker runtime, and adds the invoking user to
the docker group.
Common steps (all families):
- Adds NVIDIA's CUDA repository if missing (apt
cuda-keyringdeb,cuda-<distro>.repofor dnf/zypper). - Adds NVIDIA's Container Toolkit repository if missing (
.listfor apt,.repofor dnf/zypper). - Installs the matching kernel header / devel package for the running kernel.
- Installs the driver branch 580 packages,
cuda-toolkit-13-0, and the Container Toolkit pinned to1.19.0(the dpkg-suffixed1.19.0-1is the same upstream version expressed for apt). - For Docker backends and when Docker is missing, installs Docker
(override / opt-out flags below), enables/starts the daemon, then runs
nvidia-ctk runtime configure --runtime=dockerand restarts Docker whensystemctlis available. - Adds the invoking user (
$SUDO_USERif available, else$USER) to thedockergroup so subsequent shells can rundockerwithoutsudo— opt out with--skip-docker-group. The new group membership does not take effect in the current shell: log out and back in, or runnewgrp dockerin each new shell. - Attempts
modprobe nvidiaso verification can pass before reboot.
Family-specific package selections:
| Step | debian-family | rhel-family | suse-family |
|---|---|---|---|
| Kernel headers | linux-headers-$(uname -r) | kernel-devel-$(uname -r), kernel-headers-$(uname -r) | kernel-default-devel |
| Driver | nvidia-driver-pinning-580, nvidia-open-580 (override: $NVIDIA_DRIVER_PACKAGE_DEBIAN) | nvidia-driver-cuda, kmod-nvidia-open-dkms (override: $NVIDIA_DRIVER_PACKAGE_RHEL, $NVIDIA_DRIVER_KMOD_RHEL) | nvidia-open-driver-G06-signed-kmp-default (override: $NVIDIA_DRIVER_PACKAGE_SUSE) |
| CUDA toolkit | cuda-toolkit-13-0 | cuda-toolkit-13-0 | cuda-toolkit-13-0 |
| Container Toolkit | nvidia-container-toolkit=1.19.0-1 + base/tools/libs | nvidia-container-toolkit-1.19.0 + base/tools/libs | same as rhel |
| Docker | docker.io (override: $DOCKER_PACKAGE_DEBIAN) | moby-engine+moby-cli on Fedora when available, else docker-ce docker-ce-cli containerd.io from download.docker.com | docker |
Verification
After installation, verify:
nvidia-smi
/usr/local/cuda-13.0/bin/nvcc --version
docker info --format '{{json .Runtimes}}' | grep nvidia
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Expected nvidia-smi output includes driver 580.x and CUDA Version 13.0.
Expected nvcc output includes release 13.0.
Kubernetes Notes
For self-managed Kubernetes clusters, run the host installer on every GPU worker node or bake the same package set into the node image before installing the NVIDIA GPU Operator or device plugin.
The workflow check also warns if kubectl is available but the cluster reports
no nvidia.com/gpu allocatable capacity. In that case, install/configure the
NVIDIA GPU Operator after the worker host runtime is ready:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operator
Managed Kubernetes providers may own driver installation through node images or GPU Operator policy. Do not overwrite a provider-managed GPU node without user approval and a rollback plan.
Failure Modes
Unsupported distribution family: --install automates debian-, rhel-,
and suse-family hosts. On Arch, Alpine, Gentoo, NixOS, FreeBSD, or anything
without /etc/os-release (e.g. macOS), the script exits with a clear error
that lists the four version targets and the upstream NVIDIA install-guide
URLs:
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.htmlhttps://docs.docker.com/engine/install/
Install those four pieces using your distribution's package manager and
rerun the script with --check-only to verify. The check is universally
portable — it only queries the binaries / package databases — so once the
runtime is in place the workflow contract is satisfied regardless of the
underlying distro.
Unsupported Ubuntu/Debian derivative: When ID is e.g. pop, mint,
zorin, raspbian, or another debian-family derivative, the script maps
the host onto the upstream Ubuntu/Debian CUDA repo via UBUNTU_CODENAME /
VERSION_CODENAME (focal/jammy/noble → Ubuntu 20.04/22.04/24.04;
bullseye/bookworm/trixie → Debian 11/12/12). If the host's codename
doesn't match a known upstream release, --install exits with the same
manual-install guidance described above.
Docker not installed: --check-only reports MISSING: Docker is not installed and prints the exact rerun command appropriate to the detected
distro family. The default --install path installs Docker (docker.io /
moby-engine / docker-ce / docker depending on family), enables/starts
the daemon, configures the NVIDIA runtime, and adds the invoking user to
the docker group. If you prefer to manage Docker yourself, install it
before rerunning the script or pass --skip-docker-install.
Docker installed but docker run still needs sudo: The script adds the
invoking user to the docker group, but Linux only refreshes group
membership on a new login session. Log out and back in, or run
newgrp docker in each new shell, until the new membership is active.
Docker runtime still missing: Restart Docker, then rerun
nvidia-ctk runtime configure --runtime=docker.
Driver branch detected != 580: The driver-branch pin is exact on
debian-family (nvidia-open-580). On rhel-/suse-family the script
installs the latest open driver shipped in NVIDIA's CUDA 13.0 repo for
the detected distro, which is always ≥ 580. If your host needs a stricter
pin, set $NVIDIA_DRIVER_PACKAGE_RHEL / $NVIDIA_DRIVER_KMOD_RHEL /
$NVIDIA_DRIVER_PACKAGE_SUSE to the exact package names you want before
running --install.
Driver installed but nvidia-smi fails: Load the module with
sudo modprobe nvidia or reboot. Secure Boot may require MOK enrollment on
systems where it is enabled.
Kubernetes still has no GPU capacity: Confirm the driver works on each GPU
node with nvidia-smi, then check the GPU Operator/device plugin pods and node
labels.