Tao Setup Nvidia Gpu Host

Host setup for TAO GPU backends. Checks and, after user approval, installs NVIDIA driver branch 580, CUDA Toolkit 13.0, and NVIDIA Container Toolkit 1.19.0 for Docker/local-Docker and Kubernetes GPU worker hosts. The `--check-only` path works on any Linux distribution; `--install` automates debian-family (Ubuntu/Debian/Pop!_OS/Mint/Zorin/Raspbian), rhel-family (Fedora/RHEL/Rocky/AlmaLinux), and suse-family (openSUSE/SLES) hosts, and prints actionable manual-install steps for everything else.

Published by @NVIDIA·0 agent reads / 30d·0 saves·

NVIDIA GPU Host Setup

Use this setup skill before TAO workflows run on the docker, local-docker, or kubernetes backend. It standardizes the host GPU runtime on:

  • NVIDIA driver branch 580 (open kernel module preferred)
  • CUDA Toolkit package cuda-toolkit-13-0
  • NVIDIA Container Toolkit 1.19.0
  • Docker engine — only installed for docker / local-docker backends and only when Docker is missing. The package picked depends on the distro family (docker.io on Debian-family by default, moby-engine / docker-ce from download.docker.com on RHEL-family, docker on SUSE-family). Pass --skip-docker-install to opt out.

The check is safe and read-only by default — it works on any Linux distribution because it only probes nvidia-smi, the CUDA toolkit path, the installed container-toolkit package version (via dpkg/rpm/the nvidia-ctk binary version), and the Docker daemon's NVIDIA runtime.

Installation must be explicitly authorized by the user and rerun with --install. The install path is automated for these distro families:

FamilyTested distrosManagerNotes
debianUbuntu 22.04 / 24.04, Debian 12 (and derivatives Pop!_OS, Mint, Zorin, Raspbian, KDE Neon, etc. via UBUNTU_CODENAME / VERSION_CODENAME)apt-getAdds NVIDIA cuda-keyring + Container Toolkit .list. Docker via docker.io (override $DOCKER_PACKAGE_DEBIAN).
rhelFedora 39+, RHEL / Rocky / AlmaLinux 9 and 10dnf (or yum)Adds NVIDIA cuda-<distro>.repo + Container Toolkit .repo. Docker via Fedora moby-engine when available, otherwise docker-ce from download.docker.com.
suseopenSUSE Leap 15, SLES 15zypperAdds the same NVIDIA .repo files. Docker via the distribution docker package.
other (Arch, Alpine, Gentoo, NixOS, FreeBSD, …)n/an/a--install exits with a clear error listing the version targets and the NVIDIA install-guide URLs. Install manually, then rerun --check-only.

Quick Start

From the skill bank root:

# Check the local Docker backend host.
bash skills/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend docker --check-only

# Install or repair after user approval (prompts for confirmation; see the note below for non-interactive runs).
bash skills/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend docker --install

# Check a Kubernetes GPU worker host.
bash skills/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend kubernetes --check-only

⚠️ Note — running non-interactively (agent / skill runs): a skill run has no terminal, so the installer's Continue? [y/N] confirmation cannot be answered. After running --check-only to preview what is missing and getting the user's explicit approval, append the assume-yes flag (--yes) to the --install command so it proceeds without a prompt. That auto-confirms installation of system packages (NVIDIA driver branch 580, CUDA Toolkit 13.0, NVIDIA Container Toolkit, and — for Docker backends — Docker) and modifies the host: it adds NVIDIA package repositories, may restart Docker, and adds the invoking user to the docker group, so only do this on a host you control and have the privileges to change. When a person runs --install directly at a terminal, the script instead prompts with the exact package list before making any changes.

In an installed plugin copy that exposes skills/, use:

bash skills/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh --backend docker --check-only

Workflow Contract

Docker and Kubernetes workflows must run the check before submitting GPU work:

SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT:-$PWD}/skills/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"
[ -x "$SETUP_SCRIPT" ] || SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT:-$PWD}/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"

bash "$SETUP_SCRIPT" --backend docker --check-only || {
  echo "MISSING: TAO GPU host runtime is not ready."
  echo "After user approval, run (append --yes for non-interactive agent runs):"
  echo "  bash \"$SETUP_SCRIPT\" --backend docker --install"
  exit 1
}

Never install silently. If the check fails, explain what is missing, ask the user to authorize the fix, then run the install command and rerun the check.

What The Installer Does

The installer dispatches on the detected distribution family. On every supported family it adds NVIDIA's CUDA and Container Toolkit repositories (if missing), installs the pinned runtime packages, optionally installs Docker, wires the NVIDIA Docker runtime, and adds the invoking user to the docker group.

Common steps (all families):

  1. Adds NVIDIA's CUDA repository if missing (apt cuda-keyring deb, cuda-<distro>.repo for dnf/zypper).
  2. Adds NVIDIA's Container Toolkit repository if missing (.list for apt, .repo for dnf/zypper).
  3. Installs the matching kernel header / devel package for the running kernel.
  4. Installs the driver branch 580 packages, cuda-toolkit-13-0, and the Container Toolkit pinned to 1.19.0 (the dpkg-suffixed 1.19.0-1 is the same upstream version expressed for apt).
  5. For Docker backends and when Docker is missing, installs Docker (override / opt-out flags below), enables/starts the daemon, then runs nvidia-ctk runtime configure --runtime=docker and restarts Docker when systemctl is available.
  6. Adds the invoking user ($SUDO_USER if available, else $USER) to the docker group so subsequent shells can run docker without sudo — opt out with --skip-docker-group. The new group membership does not take effect in the current shell: log out and back in, or run newgrp docker in each new shell.
  7. Attempts modprobe nvidia so verification can pass before reboot.

Family-specific package selections:

Stepdebian-familyrhel-familysuse-family
Kernel headerslinux-headers-$(uname -r)kernel-devel-$(uname -r), kernel-headers-$(uname -r)kernel-default-devel
Drivernvidia-driver-pinning-580, nvidia-open-580 (override: $NVIDIA_DRIVER_PACKAGE_DEBIAN)nvidia-driver-cuda, kmod-nvidia-open-dkms (override: $NVIDIA_DRIVER_PACKAGE_RHEL, $NVIDIA_DRIVER_KMOD_RHEL)nvidia-open-driver-G06-signed-kmp-default (override: $NVIDIA_DRIVER_PACKAGE_SUSE)
CUDA toolkitcuda-toolkit-13-0cuda-toolkit-13-0cuda-toolkit-13-0
Container Toolkitnvidia-container-toolkit=1.19.0-1 + base/tools/libsnvidia-container-toolkit-1.19.0 + base/tools/libssame as rhel
Dockerdocker.io (override: $DOCKER_PACKAGE_DEBIAN)moby-engine+moby-cli on Fedora when available, else docker-ce docker-ce-cli containerd.io from download.docker.comdocker

Verification

After installation, verify:

nvidia-smi
/usr/local/cuda-13.0/bin/nvcc --version
docker info --format '{{json .Runtimes}}' | grep nvidia
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Expected nvidia-smi output includes driver 580.x and CUDA Version 13.0. Expected nvcc output includes release 13.0.

Kubernetes Notes

For self-managed Kubernetes clusters, run the host installer on every GPU worker node or bake the same package set into the node image before installing the NVIDIA GPU Operator or device plugin.

The workflow check also warns if kubectl is available but the cluster reports no nvidia.com/gpu allocatable capacity. In that case, install/configure the NVIDIA GPU Operator after the worker host runtime is ready:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait gpu-operator -n gpu-operator --create-namespace nvidia/gpu-operator

Managed Kubernetes providers may own driver installation through node images or GPU Operator policy. Do not overwrite a provider-managed GPU node without user approval and a rollback plan.

Failure Modes

Unsupported distribution family: --install automates debian-, rhel-, and suse-family hosts. On Arch, Alpine, Gentoo, NixOS, FreeBSD, or anything without /etc/os-release (e.g. macOS), the script exits with a clear error that lists the four version targets and the upstream NVIDIA install-guide URLs:

  • https://docs.nvidia.com/cuda/cuda-installation-guide-linux/
  • https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
  • https://docs.docker.com/engine/install/

Install those four pieces using your distribution's package manager and rerun the script with --check-only to verify. The check is universally portable — it only queries the binaries / package databases — so once the runtime is in place the workflow contract is satisfied regardless of the underlying distro.

Unsupported Ubuntu/Debian derivative: When ID is e.g. pop, mint, zorin, raspbian, or another debian-family derivative, the script maps the host onto the upstream Ubuntu/Debian CUDA repo via UBUNTU_CODENAME / VERSION_CODENAME (focal/jammy/noble → Ubuntu 20.04/22.04/24.04; bullseye/bookworm/trixie → Debian 11/12/12). If the host's codename doesn't match a known upstream release, --install exits with the same manual-install guidance described above.

Docker not installed: --check-only reports MISSING: Docker is not installed and prints the exact rerun command appropriate to the detected distro family. The default --install path installs Docker (docker.io / moby-engine / docker-ce / docker depending on family), enables/starts the daemon, configures the NVIDIA runtime, and adds the invoking user to the docker group. If you prefer to manage Docker yourself, install it before rerunning the script or pass --skip-docker-install.

Docker installed but docker run still needs sudo: The script adds the invoking user to the docker group, but Linux only refreshes group membership on a new login session. Log out and back in, or run newgrp docker in each new shell, until the new membership is active.

Docker runtime still missing: Restart Docker, then rerun nvidia-ctk runtime configure --runtime=docker.

Driver branch detected != 580: The driver-branch pin is exact on debian-family (nvidia-open-580). On rhel-/suse-family the script installs the latest open driver shipped in NVIDIA's CUDA 13.0 repo for the detected distro, which is always ≥ 580. If your host needs a stricter pin, set $NVIDIA_DRIVER_PACKAGE_RHEL / $NVIDIA_DRIVER_KMOD_RHEL / $NVIDIA_DRIVER_PACKAGE_SUSE to the exact package names you want before running --install.

Driver installed but nvidia-smi fails: Load the module with sudo modprobe nvidia or reboot. Secure Boot may require MOK enrollment on systems where it is enabled.

Kubernetes still has no GPU capacity: Confirm the driver works on each GPU node with nvidia-smi, then check the GPU Operator/device plugin pods and node labels.

Bundled with this artifact

6 files

Reference files that ship alongside this artifact. Agents pull these in only when the task needs them.

More on the bench

SKILL0

Whisper

OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR.

data-science-ml+2
0
SKILL0

Guidance

Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance - Microsoft Research's constrained generation framework

ai-prompt-engineering+2
0
SKILL0

Pinecone

Managed vector database for production AI applications. Fully managed, auto-scaling, with hybrid search (dense + sparse), metadata filtering, and namespaces. Low latency (<100ms p95). Use for production RAG, recommendation systems, or semantic search at scale. Best for serverless, managed infrastructure.

data-science-ml+2
0