SLURM
Remote GPU compute platform for clusters managed by SLURM. Jobs are submitted
from the TAO service or SDK host to a login node over SSH, staged on a shared
filesystem, submitted with sbatch, and executed with srun container support.
Use SLURM when the user has access to a managed GPU cluster, shared Lustre storage, and scheduler-owned GPU allocation. Do not use SLURM for local files that exist only on the agent machine; data and outputs must be reachable from the cluster.
Preflight
# 1. SSH to the login node works without a password prompt
SLURM_HOST="${SLURM_HOSTNAME%%,*}"
[ -n "$SLURM_USER" ] && [ -n "$SLURM_HOST" ] || {
echo "MISSING: set SLURM_USER and SLURM_HOSTNAME (comma-separated for failover) in your env (~/.config/tao/.env)."
exit 1
}
ssh -o BatchMode=yes -o ConnectTimeout=10 "${SLURM_USER}@${SLURM_HOST}" "true" 2>/dev/null || {
echo "MISSING: passwordless SSH to ${SLURM_USER}@${SLURM_HOST} not working. See references/ssh-setup.md."
exit 1
}
# 2. Optional: TAO SDK wrapper for Job handles + S3 wrapping.
# nvidia-tao-sdk is on public PyPI; pin lives in versions.yaml (wheels.tao_sdk_slurm).
PIN=$("${TAO_SKILL_BANK_PATH:?}/scripts/resolve_versions_key.py" wheels.tao_sdk_slurm)
python -c "import tao_sdk" 2>/dev/null || {
echo "MISSING: nvidia-tao-sdk not installed. Run:"
echo " pip install \"$PIN\""
exit 1
}
If a check fails, the agent prompts the user to authorize the install/fix via Bash.
A third preflight step applies only for private nvcr.io images: Pyxis on
the compute nodes needs persistent enroot credentials in
~/.config/enroot/.credentials on the cluster (it does NOT read NGC_KEY from
the job env). Without them, auth-gated pulls fail with "Could not process JSON
input" at job startup. This runs once per (cluster, user). See
references/ssh-setup.md for the full check and the printf | ssh install
pattern that keeps NGC_KEY out of history, files, and chat output. Skip it for
public images.
Prerequisites
Before any job is submitted, the host running the TAO service or SDK must log in
to at least one host from SLURM_HOSTNAME over SSH without an interactive
password prompt. The handler runs sbatch, squeue, sacct, scancel, and
log tails non-interactively, so password or 2FA prompts will fail the job at
submit or status time.
Set this up once per (host, login node, user) tuple: create an SSH keypair,
install the public key on each login host, trust the host key, lock private-key
permissions to chmod 600, and verify with ssh -o BatchMode=yes .... See
references/ssh-setup.md for the full step-by-step (including the ~/.ssh/config
alias, the container key-mount note, and the 2FA / SSH_AUTH_SOCK fallback). The
same file holds the SSH failure remediation prompt to show the user when
passwordless SSH fails.
Credentials
- SLURM_USER (required): SSH username for the login node. In microservices
workspace metadata this is
cloud_specific_details.slurm_user. - SLURM_HOSTNAME (required): Comma-separated login hostnames for failover.
Microservices schema stores this as the list field
cloud_specific_details.slurm_hostname. - SLURM_PARTITION (required): Partition list for GPU job submission. Ask
for this in the mandatory SLURM intake list. The packaged default is
polar,polar3,polar4,grizzly, which are treated as 4-hour queues. - SSH_KEY_PATH (preferred and expected before launch): private key path for
non-interactive public-key auth to the login node. If passwordless SSH fails,
ask the user for
SSH_KEY_PATH=/path/to/private_keyand show the setup steps inreferences/ssh-setup.md; do not bury this behind several alternate choices. - SSH_AUTH_SOCK (advanced fallback): SSH agent socket with an accepted key
already loaded. Prefer
SSH_KEY_PATHin user-facing remediation prompts. - SLURM_BASE_RESULTS_DIR (optional): Base shared filesystem path. Default
convention from
tao-coreis/lustre/fsw/portfolios/edgeai/<your-dir>, where<your-dir>is your per-user directory on the cluster. - SLURM_ACCOUNT (usually required by site policy): Account charged by
#SBATCH --account.
Do not ask for SLURM_ACCOUNT or SLURM_BASE_RESULTS_DIR in the initial
intake unless the user says their site requires an account, wants a custom
results root, or the workflow cannot proceed without overriding defaults.
Backend Details
Use backend_details.backend_type = "slurm" when routing a job to this
platform. Supported backend details from the microservices schema:
{
"backend_type": "slurm",
"partition": "polar,polar3,polar4,grizzly",
"cluster_name": "optional-name"
}
Runtime metadata is stored under backend_details.slurm_metadata, especially
slurm_job_id and job_dir. Do not invent these values. They are written
after sbatch returns a scheduler job id.
Storage
SLURM jobs run on the cluster, so local paths from the API host are not valid dataset paths. Prefer shared filesystem URIs:
- Use
lustre:///absolute/pathfor user-provided datasets on Lustre. slurm://paths may appear in microservices metadata and are converted to actual Lustre paths before the container starts.- Avoid bare
/local/pathandfile://dataset URIs for SLURM. Validation intao-corerejects local and file paths for remote backends.
Accept either dataset roots or direct spec-key paths:
- Root mode:
/lustre/.../<model>/train, which model skills map to required files such as<root>/annotations.jsonand<root>as media path. - Direct spec mode: exact fields such as
custom.train_dataset.annotation_path=/lustre/.../train.jsonandcustom.train_dataset.media_path=/lustre/.../videos.tar.gz.
After passwordless SSH succeeds and before generating scripts, validate each required dataset file/path from the login host:
ssh -o BatchMode=yes <SLURM_USER>@<working-login-host> \
'test -e /lustre/.../annotations.json && test -e /lustre/.../media_or_archive'
If the remote test -e fails, stop and ask for corrected paths or for the data
to be staged onto shared cluster storage. Do not create runner scripts that will
fail inside the first training job.
Results default to:
/lustre/fsw/portfolios/edgeai/<your-dir>/results/<job_id>
<your-dir> is your per-user directory on the cluster.
The runner sets TAO_API_RESULTS_DIR to the parent results directory because
container code appends the job id when writing status and artifacts.
Use Lustre, not S3, for SLURM job inputs. SLURM's scheduler enforces a GPU-idle timeout — a long
s3://download at the top of the script can burn the allocation before training begins, and the scheduler may kill the job. Stage training data onto Lustre first; S3 / HF / NGC pre-fetch is fine only for small auxiliary inputs (checkpoints, configs). Seereferences/sdk-usage.mdfor the full rationale.
Container Execution
tao-core uses the SLURM handler to run TAO containers through Pyxis/Enroot:
- Stage compact JSON files for specs, environment, and cloud metadata under
<job_dir>/specs,<job_dir>/env, and<job_dir>/meta. - Optionally convert the Docker image to a cached SQSH image with
srun -n1 -p <conversion_partition> enroot import. - Write an sbatch script under
<job_dir>/sbatch/job_<job_id>.sbatch. - Submit
sbatch --export=ALL <script>. - Run the container with
srun --container-image=<image> --container-mounts=/lustre.
Image formats accepted by the handler:
/path/to/image.sqshregistry#image:tagdocker://registry#image:tag- ordinary
registry/image:tag, which is converted to Pyxis form when needed
SQSH conversion is cached by image name. For :latest images, cached SQSH is
used unless force_reconvert_latest is enabled.
Resource Mapping
Defaults from tao-core:
num_nodes: 1num_gpus: 4max_num_gpus_per_node: 8cpus_per_task: 16time_hours: 4timeout_hours: 3.8max_time_hours: 4container_mounts:/lustreuse_requeue: trueuse_sqsh: true
When generating launchers or wrapper scripts for SLURM, set the wall-time defaults explicitly from the packaged platform resource defaults:
export SLURM_TIME_HOURS="${SLURM_TIME_HOURS:-4}"
export SLURM_TIMEOUT_HOURS="${SLURM_TIMEOUT_HOURS:-3.8}"
Do not default to 12 hours on SLURM. If the user supplies a longer
SLURM_TIME_HOURS, verify that the selected partition supports it before
submitting. For the packaged default partition list
polar,polar3,polar4,grizzly, reject requests above 4 hours and ask for a
different partition only if the user actually wants a longer wall time.
When num_gpus is greater than or equal to max_num_gpus_per_node, the
handler treats the request as exclusive per node and computes additional nodes
from total GPU count when necessary.
For multi-node jobs (num_nodes > 1), the sbatch script exports WORLD_SIZE,
MASTER_ADDR, MASTER_PORT, NODE_RANK, and NUM_GPU_PER_NODE, and Cosmos-RL
has special multi-node role handling for controller, policy, and rollout
workers. See references/multi-node.md for the full sbatch directives, the
rendezvous env-var table and contract, and cluster requirements.
Monitoring
- Scheduler status comes from the stored SLURM job id via
squeueorsacct. - TAO terminal status comes from
status.jsonin the shared results folder. - If the user enabled chat monitoring, continue polling at the requested
interval while the job is
PENDING,RUNNING, or otherwise non-terminal. Do not stop after a fixed elapsed time such as 30 minutes; long queue waits are normal on shared GPU partitions. - Do not send a final response for a non-terminal SLURM job when chat monitoring is enabled. A final response is a detach action; use it only if the user asked to detach/stop or the job reached terminal state.
- Logs are read over SSH from:
<job_dir>/slurm-logs/<slurm_job_name>-<slurm_job_id>/main.out
<job_dir>/slurm-logs/<slurm_job_name>-<slurm_job_id>/main.err
Status mapping:
PENDING->PendingRUNNINGorCOMPLETING->RunningCOMPLETED-> checkstatus.jsonFAILED,BOOT_FAIL,DEADLINE,OUT_OF_MEMORY,NODE_FAIL-> retry if logs match retriable infrastructure patterns, otherwiseErrorCANCELLED,PREEMPTED,REVOKED->CanceledTIMEOUT->ErrorSUSPENDED,STOPPED->Paused
Cancellation
Cancel by looking up backend_details.slurm_metadata.slurm_job_id and running
scancel <slurm_job_id> over SSH. Treat missing or already terminated SLURM
jobs as successful cancellation.
Multi-node training (distributed)
SLURM is the platform of choice for large multi-node runs — pass num_nodes > 1
and the SDK handles the sbatch directives and PyTorch-distributed env vars
automatically. See references/multi-node.md for a worked create_job example,
the generated sbatch directives, the rendezvous env-var table (WORLD_SIZE,
NUM_GPU_PER_NODE, NODE_RANK, MASTER_ADDR, MASTER_PORT), the Cosmos-RL
role note, cluster requirements (Pyxis/Enroot, InfiniBand/NVLink, Lustre), and
upstream reference links.
Running via the TAO SDK
The SDK install is covered in Preflight — pip install 'nvidia-tao-sdk[slurm]'.
Use it when you want Job handles, the sbatch/squeue/sacct plumbing handled
for you, run-folder durability via ActionWorkflow, or convenient cloud-storage
I/O (s3://, hf_model://, ngc://). Without the SDK, drive sbatch and
srun yourself.
Auto-retry is fully automatic: a background monitor polls squeue/sacct
and re-sbatch's the staged script on infrastructure-looking failures up to
MAX_JOB_RETRIES = 10, while plain training failures surface immediately. In
addition, #SBATCH --requeue is set by default (SLURM_USE_REQUEUE, defaults
to true). See references/sdk-usage.md for the SlurmSDK / build_entrypoint
code example, the Lustre-not-S3 rule, the retriable-failure classification, and
the full auto-retry and requeue behavior.
Failure Modes
Common failures: SSH auth failure, local dataset path rejected, SQSH conversion
timeout, Pyxis/Enroot unavailable, and bad-node / transient GPU failures (which
the handler retries up to the configured limit). See
references/troubleshooting.md for the diagnosis and remediation of each.