Local Docker
Single-node execution platform that runs TAO jobs as named Docker containers on the local Docker daemon. It is useful for development, debugging, small runs, and machines where the agent host already has the required GPUs, NVIDIA driver, Docker, and NVIDIA Container Toolkit.
Use local Docker when the data is local to the Docker host or accessible through mounted volumes/cloud credentials. Do not use it for remote cluster scheduling, multi-node training, or jobs that need SLURM queueing.
Preflight
The workflow must verify the host GPU runtime before starting Docker jobs. If the check fails, prompt the user to approve the install, run the printed install command, and rerun the preflight.
# Host GPU runtime: NVIDIA driver 580, CUDA 13.0, NVIDIA Container Toolkit 1.19.0.
TAO_SKILL_BANK_ROOT="${TAO_SKILL_BANK_ROOT:-$PWD}"
SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT}/skills/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"
[ -x "$SETUP_SCRIPT" ] || SETUP_SCRIPT="${TAO_SKILL_BANK_ROOT}/platform/tao-setup-nvidia-gpu-host/scripts/setup-nvidia-gpu-host.sh"
bash "$SETUP_SCRIPT" --backend docker --check-only || {
echo "MISSING: TAO GPU host runtime is not ready."
echo "After user approval, run:"
echo " bash \"$SETUP_SCRIPT\" --backend docker --install --yes"
exit 1
}
# Mode 1 — direct docker (no Python). All you need is docker + the GPU runtime.
docker info >/dev/null 2>&1 || { echo "MISSING: docker daemon not reachable. Start Docker."; exit 1; }
docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi >/dev/null 2>&1 || {
echo "MISSING: NVIDIA Container Toolkit not installed/configured. See:"
echo " bash \"$SETUP_SCRIPT\" --backend docker --install --yes"
exit 1
}
# Mode 2 — TAO SDK wrapper. Adds Job handles, S3 I/O wrapping, ActionWorkflow.
# Skip this block if Mode 1 is sufficient for the user's request.
# When Mode 2 is in scope, read `tao-skill-bank:tao-run-platform` for the DockerSDK
# kwarg contract, build_entrypoint, and monitoring patterns.
# nvidia-tao-sdk is on public PyPI; pin lives in versions.yaml (wheels.tao_sdk_docker).
PIN=$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_docker)
python -c "import tao_sdk" 2>/dev/null || {
echo "MISSING: nvidia-tao-sdk not installed. Run:"
echo " pip install \"$PIN\""
exit 1
}
python -c "import docker" 2>/dev/null || {
echo "MISSING: docker Python client not installed. Run:"
echo " pip install \"$PIN\""
exit 1
}
# DockerSDK attaches every job container to ${DOCKER_NETWORK:-tao_default}. If
# the network does not exist, container start fails instantly with
# `network <name> not found` for every create_job.
DOCKER_NETWORK_NAME="${DOCKER_NETWORK:-tao_default}"
docker network ls --format '{{.Name}}' | grep -qx "$DOCKER_NETWORK_NAME" || {
echo "MISSING: docker network '$DOCKER_NETWORK_NAME' not found. After user approval, run:"
echo " docker network create $DOCKER_NETWORK_NAME"
exit 1
}
If a check fails, the agent prompts the user to authorize the install/fix via Bash before proceeding.
Credentials
There are no platform credentials required beyond access to the Docker daemon.
Optional environment:
- DOCKER_HOST: Optional Docker daemon URL. If unset, the SDK uses the Docker Python client's normal environment/default socket resolution.
- DOCKER_NETWORK: Docker network for job containers. Default is
tao_default. - DOCKER_USERNAME: Registry username. Default is
$oauthtokenfor NGC. - NGC_KEY: Used when pulling private images from
nvcr.io. - HOST_SSH_PATH: Mounted into AutoML brain containers when they need SSH keys to monitor remote SLURM child jobs.
- ACCESS_KEY, SECRET_KEY, S3_ENDPOINT_URL, S3_BUCKET_NAME: Optional S3-compatible storage settings for jobs that still read/write cloud storage from a local container.
Launch Preflight
Before generating scripts or starting containers:
- Verify the Docker daemon is reachable and the NVIDIA runtime can see GPUs.
- Verify every local/file dataset annotation and media path exists on the Docker host.
- For
s3://datasets/results, verifyACCESS_KEYandSECRET_KEYare set and the exact paths are readable withaws s3 ls. - Verify model-specific credentials such as
HF_TOKENbefore launch.
Multi-GPU and multi-node
Multi-node is not supported on local Docker. One job runs on the local Docker daemon's host with no cross-host coordination.
Multi-GPU on the local host is supported via the NVIDIA Container Toolkit's --gpus flag (--gpus all or --gpus '"device=0,1,2,3"'). DockerSDK.create_job(gpu_count=N) plumbs through to --gpus. Single-host distributed init uses localhost; torchrun --nproc-per-node=N or PyTorch DDP work as usual.
Backend Details
Use the SDK backend value local-docker. The local backend schema has no extra
backend details, so most routing is controlled by environment and job
parameters:
{
"backend_type": "local-docker",
"num_gpu": 1
}
Following the Lepton/Brev SDK design, platform/control-plane values stay in SDK
state and Docker labels. The SDK does not inject BACKEND, HOST_PLATFORM,
MONGOSECRET, DOCKER_HOST, or DOCKER_NETWORK into the training container.
Container Execution
The TAO SDK local Docker handler starts containers through the Docker Python client:
- Backend job name uses the
tao-job-<job_id>form used by SDK handlers. - Command is usually
["/bin/bash", "-c", "<job command>"]. - Containers run detached. The SDK keeps containers by default so status and
logs remain inspectable, unless
DOCKER_AUTO_REMOVE=true. /dev/shmis mounted as tmpfs.- The configured Docker network is applied by the Docker daemon for the job container; it is not passed through as a process environment variable.
- Existing containers with the same job id are stopped and removed before a replacement starts.
For GPU access, the handler auto-detects the host type:
- Tegra or Jetson hosts use
runtime="nvidia"plusNVIDIA_VISIBLE_DEVICESandNVIDIA_DRIVER_CAPABILITIES=all. - Standard x86 hosts use Docker
device_requestswith GPU capabilities.
If num_gpus is 0, no GPUs are assigned. If num_gpus is -1, all visible
GPUs are requested. Prefer explicit GPU counts for shared development machines.
Storage
Local Docker accepts local and file:// paths because the container runs on the
same Docker host. Make sure every path in the spec is either:
- mounted into the container by the handler or surrounding service,
- reachable from inside the container already, or
- a cloud URI with matching credentials.
For remote/shared filesystems, prefer the platform that owns that filesystem.
For example, use SLURM plus lustre:///... for Lustre paths on a cluster.
Monitoring
- The SDK handler maps Docker container state directly: created -> Pending, running/restarting -> Running, paused -> Paused, exit code 0 -> Complete, nonzero exit -> Error.
- Logs come directly from the named container through the Docker Python client
(
docker logs tao-job-<job_id>).
If the container has exited, died, is being removed, or cannot be found, status reconciliation treats the backend process as terminated.
Cancellation
Cancellation stops the named container. GPU ownership is managed by Docker / the NVIDIA runtime, not by TAO Core's local GPU manager.
Optional: via the TAO SDK
If you want Job handles, S3 I/O wrapping via the SDK's script_runner, or
durability across sessions:
from tao_sdk.platforms.docker import DockerSDK
sdk = DockerSDK() # reads DOCKER_HOST, NGC_KEY, S3 creds from env
job = sdk.create_job(
image='nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt',
command='dino train -e /tmp/spec.yaml',
gpu_count=1,
inputs={'/data/train.json': 's3://bucket/coco/train.json'},
outputs=['/results/'],
)
status = sdk.get_job_status(job.id)
logs = sdk.get_job_logs(job.id, tail=200)
This wraps the same docker run invocation under a Job handle and routes
the entrypoint through script_runner so inputs/outputs get downloaded
from / uploaded to S3 automatically. If you don't need those, just use
docker run directly — no SDK install required.
Failure Modes
Docker client not initialized: Verify the Docker Python package is installed,
set DOCKER_HOST if you are not using the default local socket, and confirm the
process can talk to the daemon.
GPU assignment failed: Requested GPUs are unavailable, the NVIDIA Container
Toolkit is not configured, or the Docker daemon cannot create GPU device
requests. Use fewer GPUs, wait for another job to finish, or verify
docker run --gpus ... works on the host.
Image pull auth failed: Set a valid NGC_KEY for private nvcr.io images
or run docker login nvcr.io -u '$oauthtoken' on the Docker host.
Container exited unexpectedly: Check docker logs tao-job-<job_id>, the
configured DOCKER_NETWORK, and the command produced by the SDK action runner.
Path missing inside container: A local path on the host is not necessarily mounted into the job container. Use a path convention supported by the action runner or configure an explicit volume through the surrounding service.