Lepton
Managed GPU compute platform on DGX Cloud. Jobs are submitted as container workloads that run on dedicated or shared GPU node groups. Lepton handles scheduling, image pulling, log collection, and job lifecycle.
Use Lepton when you need cloud-based GPU compute without managing Kubernetes or SLURM infrastructure directly.
Preflight
Lepton is API-first — no docker-run alternative. This skill needs the TAO SDK with the Lepton extra. nvidia-tao-sdk is on public PyPI; the pinned version lives in versions.yaml (wheels.tao_sdk_lepton), resolved via scripts/resolve_versions_key.py:
PIN=$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_lepton)
python -c "import tao_sdk" 2>/dev/null || {
echo "MISSING: nvidia-tao-sdk not installed. Run:"
echo " pip install \"$PIN\""
exit 1
}
python -c "import leptonai" 2>/dev/null || {
echo "MISSING: lepton extra not installed. Run:"
echo " pip install \"$PIN\""
exit 1
}
If missing, the agent prompts the user to authorize the install via Bash, then re-runs the preflight before continuing.
Credentials
- LEPTON_WORKSPACE_ID (required): Determines which cluster and billing account the job runs under.
- LEPTON_AUTH_TOKEN (required): API token for authenticating with the Lepton control plane.
- NGC_KEY (optional): Used to create image pull secrets for pulling TAO container images from nvcr.io.
- ACCESS_KEY / SECRET_KEY (optional): S3-compatible storage keys for dataset and checkpoint URIs.
- S3_ENDPOINT_URL (optional): Custom S3 endpoint (e.g., for MinIO or non-AWS S3).
- S3_BUCKET_NAME (optional): Bucket for job output artifacts.
- CLOUD_REGION (optional): Storage region (e.g., us-east-1).
Launch Preflight
Before generating scripts or submitting jobs:
- Verify
LEPTON_WORKSPACE_IDandLEPTON_AUTH_TOKENare set. - Verify the workspace API is reachable with the packaged helper:
scripts/check_tao_launch_preflight.py --platform lepton .... - For
s3://datasets/results, verifyACCESS_KEYandSECRET_KEYare set and the exact paths are readable withaws s3 ls. - For NFS/Lustre mounted paths, require proof from Lepton volume/storage
permissions that the path will be mounted into the job. Do not treat a local
filesystem
test -eon the agent host as proof for Lepton jobs. - Verify model-specific credentials such as
HF_TOKENbefore launch.
Backend Details
LeptonSDK.create_job accepts these Lepton-specific kwargs (in addition to the platform-agnostic ones — image, command, gpu_count, env_vars, inputs, outputs, hooks):
resource_shape: explicit GPU resource shape ID (e.g.,"gpu.8xh100-sxm"). When set, skips the auto-resolution fromgpu_count. The format is opaque (whatever Lepton's API returns as instance metadata.id) — discover valid IDs viasdk.list_resource_shapes().dedicated_node_group: node group ID for guaranteed GPU allocation (no preemption). Omit for shared resources.num_nodes: number of nodes for distributed training. Default 1. When > 1, enables intra-job communication and PyTorch distributed initialization (see Multi-node training).mounts: pre-builtMountobjects for NFS / Lustre. Auto-detected from the node group when not set.
Discovering the workspace's shapes / volumes
shapes = sdk.list_resource_shapes()
# {<platform_id>: {"cluster": ..., "gpu_type": "gpu.8xh100-sxm",
# "gpu_count": 8, "instance_type": ..., ...}, ...}
volumes = sdk.get_volumes(node_group_id="my-h100-pool")
# [{"name": "lustre", "from_path": "/lustre", "type": "Lustre"}, ...]
prefixes = sdk.get_storage_permissions("lustre", "my-h100-pool")
# ["/lustre/fsw/portfolios/edgeai/...", ...]
Multi-node training (distributed)
Pass num_nodes > 1 to create_job for multi-node distributed training. The Lepton handler (tao_sdk/platforms/lepton/handler.py) configures the underlying LeptonJob by setting intra_job_communication=True (opens pod-to-pod networking), parallelism=num_nodes and completions=num_nodes (Lepton schedules N replicas), and exports WORLD_SIZE=num_nodes as a container env var.
Lepton's native per-replica env vars use Lepton-specific names (LEPTON_JOB_WORKER_INDEX, LEPTON_JOB_TOTAL_WORKERS, LEPTON_JOB_WORKER_PREFIX, LEPTON_SUBDOMAIN), so the handler prepends a bootstrap that sources Lepton's official translation script:
wget -O init.sh https://raw.githubusercontent.com/leptonai/scripts/main/lepton_env_to_pytorch.sh
chmod +x init.sh
source init.sh
# user command runs here
After sourcing, the following env vars are set:
| Env var | Source | Value |
|---|---|---|
MASTER_ADDR | script | ${LEPTON_JOB_WORKER_PREFIX}-0.${LEPTON_SUBDOMAIN} |
MASTER_PORT | script | 29400 |
NNODES | script | ${LEPTON_JOB_TOTAL_WORKERS} |
NODE_RANK | script | ${LEPTON_JOB_WORKER_INDEX} |
WORKER_ADDRS | script | comma-separated list of non-master worker hostnames |
WORLD_SIZE | TAO SDK handler | num_nodes (TAO container's convention — same value as NNODES) |
NUM_GPU_PER_NODE | TAO SDK handler | gpu_count (read by TAO container's entrypoint) |
job = sdk.create_job(
image='nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt',
command='dino train -e /tmp/spec.yaml', # TAO entrypoint reads WORLD_SIZE + NUM_GPU_PER_NODE
gpu_count=8, # GPUs per node
num_nodes=4, # 4 × 8 = 32 GPUs total
dedicated_node_group='my-h100-pool',
inputs={'/data/train.json': 's3://bucket/coco/train.json'},
outputs=['/results/'],
)
For raw torchrun-based commands (non-TAO containers):
command='torchrun --nnodes=$NNODES --nproc-per-node=8 --node-rank=$NODE_RANK '
'--master-addr=$MASTER_ADDR --master-port=$MASTER_PORT train.py'
Two ways to run distributed jobs on Lepton
| Path | When to use |
|---|---|
TAO SDK create_job(num_nodes=N) (this skill) | Programmatic submission from agent code; you want the SDK's S3 wrapping, monitoring, failure analysis, and JobStore. |
| Lepton "Torchrun" job type (Lepton UI / lep CLI) | Hand-crafted submission via the Lepton console. Lepton's UI has a first-class "Torchrun" mode that wires up the rendezvous for you — no bootstrap script needed. See the official example. |
Reference reading
- NVIDIA's Lepton multi-node PyTorch example (UI / Torchrun mode): https://docs.nvidia.com/dgx-cloud/lepton/examples/batch-job/distributed-training-with-pytorch/
- The translation script the SDK sources: https://github.com/leptonai/scripts/blob/main/lepton_env_to_pytorch.sh
- PyTorch distributed (env-var rendezvous): https://pytorch.org/docs/stable/elastic/run.html
- NCCL networking tuning: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
Notes
- Prefer
dedicated_node_groupfor multi-node to keep replicas on the same low-latency interconnect (NVLink / InfiniBand). - If a replica is preempted on a shared node group, the whole job fails — Lepton doesn't elastically restart in v1. Use a dedicated node group for long runs.
- For Lustre-backed datasets, the same mount is exposed to every replica — no per-replica I/O wrapping needed.
Cloud Storage
Even though the platform is Lepton, the storage layer is S3-compatible. Always use aws as the cloud_metadata key and s3:// as the URI protocol for both datasets and results_dir.
- Correct:
s3://bucket-name/path - Incorrect:
lepton://bucket-name/path
The container's get_cloud_storage_class_object() parses the URI protocol to look up credentials in CLOUD_METADATA[protocol][bucket].
Shared Storage (NFS/Lustre)
Node groups can have NFS or Lustre volumes attached. The SDK auto-detects these and mounts them into containers for persistent cross-job data sharing.
SDK Functions
sdk.get_volumes(node_group_id=None)— returns available volumes (name, from_path, type) from node group specsdk.get_storage_permissions(volume_name, node_group_id)— returns allowed path prefixes for a volume
LeptonSDK.create_job() calls these automatically to detect mounts and build the appropriate Mount objects for job specs.
How the script runner uses mounts
When a Lustre mount is available:
- Inputs: S3 paths are mapped to Lustre (
s3://bucket/path→/mnt/lustre/bucket/path). If the file exists on Lustre, it's used directly (zero download). If missing, it's downloaded from S3 to Lustre and persists for future jobs. - Outputs: Results write to Lustre first (fast, persistent), then upload to S3 (durable). Downstream jobs (e.g., gap analysis) can read results directly from Lustre without an S3 round-trip.
Volume preference order
lustre > filestore > first available
Lustre Cache Invalidation
Lustre caches files persistently across jobs. There is no built-in invalidation. If upstream data changes but the S3 path stays the same, Lustre serves the stale cached version. To force a cache miss:
- Rename the file on S3 (e.g.,
prompt_v2.txtinstead of overwritingprompt.txt) - Use a new storage_root between iterations to avoid cross-iteration staleness
- Use a new path for any regenerated artifacts
Monitoring
Job Status
Use sdk.get_job_status(job_id) for high-level status (Pending, Running, Complete, Error).
Replica Status
Use sdk.get_job_replicas(job_id) during startup for detailed replica-level info. Each replica is a dict:
replicas = sdk.get_job_replicas(job_id)
for r in replicas:
node = r["status"]["node"]["name"] # e.g., "node-ip-10-50-111-24"
node_group = r["status"]["node"]["node_group_id"]
cpu = r["status"]["cpu"] # e.g., 2
memory_mb = r["status"]["memory_in_mb"] # e.g., 8192
readiness = r["status"].get("readiness_issue")
if readiness:
reason = readiness["reason"] # "InProgress", "Failed", "ConfigError"
message = readiness["message"] # "Pulling image", "Mount point not found", etc.
Key readiness_issue patterns:
reason="InProgress",message="Pulling image"— image pull in progress (normal for large images)reason="Failed"— image pull failed (check NGC_KEY)reason="ConfigError"— node issue (mount failure, GPU error)- No
readiness_issue— replica is running
Replica status is especially useful when a job is stuck in Pending — it reveals whether the issue is image pulling, resource scheduling, or node health.
Job Logs
Use sdk.get_job_logs(job_id, tail=N) for the most recent N log lines. Logs are fetched from Lepton's log collection service.
Parallel Jobs
For workflow stages that run in parallel (e.g., video generation x8):
- Launch: Call
execute_step(plan, step_id, extra_args={"split_id": i})for each split. Each call returns immediately with a job_id. - Monitor: Poll all jobs:
sdk.get_job_status(job_id)for each. Useget_job_replicas(job_id)for startup diagnostics. - Completion: All jobs done when every status is
CompleteorError. - Partial failure: Retry only failed splits — successful splits don't need re-running. Pass the same
split_idtoexecute_step.
Failure Analysis
When a job fails, use sdk.get_failure_analysis(job_id) for automatic root cause detection:
analysis = sdk.get_failure_analysis(job_id)
if analysis:
print(analysis["err_class"]) # e.g., "ERR_PROGRAM"
print(analysis["suggestion"]) # Human-readable fix
for event in analysis.get("job_failure_by_node_event", []):
print(event["node_event_name"], event["message"])
# e.g., "OOM", "OOM encountered, victim process: cosmos-rl-evalu, pid: 3368483"
Returns:
err_class: Error classification (ERR_PROGRAM,ERR_INFRA, etc.)suggestion: What likely went wrong and how to fix itjob_failure_by_node_event: Node-level events (OOM kills, GPU errors, mount failures)log_streams: Relevant log snippets with error context
Always call this on failed jobs before retrying — it distinguishes user errors (bad config, OOM) from infrastructure issues (node failure, eviction).
Failure Modes
OOM killed: Container exceeded GPU or system memory. Detection: get_failure_analysis() returns node_event_name: "OOM". Common causes: evaluation.batch_size too high, max_length too large for available KV cache. Recovery: reduce batch_size, add GPUs with tensor parallelism, or reduce max_length.
Image pull failure: The TAO container image cannot be pulled from nvcr.io. Usually caused by a missing or expired image pull secret. The SDK auto-provisions the secret from NGC_KEY, but if NGC_KEY is invalid, the job will fail. Detection: check get_job_replicas() — readiness_issue.reason will show InProgress with message = "Pulling image" for extended periods, or Failed if the pull fails. Recovery: verify NGC_KEY is valid.
Resource unavailable: The requested GPU shape is not available. Job enters Queueing state indefinitely. Detection: Pending > 15 minutes, replicas show no node assignment. Recovery: try a different resource_shape or dedicated_node_group, or wait for resources.
Auth failure: Invalid or expired LEPTON_AUTH_TOKEN. All API calls fail with 401/403. Detection: job creation raises an exception immediately. Recovery: refresh the token and reinitialize the SDK.
Unhealthy node: The assigned node has infrastructure issues (mount failures, GPU errors, network problems). Detection: check get_job_replicas() — readiness_issue.reason = "ConfigError" with messages like "Mount point not found". The job stays Pending indefinitely on the bad node. Recovery: cancel the job and resubmit — Lepton will schedule on a different node. If the issue recurs, try a different dedicated_node_group or resource_shape.
Job eviction: On shared node groups, Lepton may evict jobs under resource pressure. Detection: job unexpectedly transitions from Running to Error. Recovery: retry, or use a dedicated_node_group.