TAO Inference Microservice
Instructions
To start an inference service:
- Collect required inputs (Section 1) and resolve the container image (Section 2).
- Build the job payload and inner command (Sections 3–4.1); use
references/code-templates.yaml→job_payload_builder. - Read
skills/platform/<platform>/SKILL.mdand start the container (Section 4.2). - Write the service registry and poll readiness (Section 4.3); use
references/code-templates.yaml→registry_write.<platform>andreadiness_check.
To send an inference request:
- Resolve which service receives the request per Section 6.0 (by
job_id, bynetwork_arch, or by explicit user choice when multiple services run — never silently default to"latest"when more than one service exists), then read the endpoint fromreferences/code-templates.yaml→request.registry_readwith the resolvedjob_id. - Before building the request body, prompt the user for the vLLM-style sampling parameters (Section 6.1). Present
max_tokens,top_p,temperature(and any per-arch extras) with their defaults; let the user override or skip each one to accept the default. Never silently use defaults. - Build and send the body per Section 6.2; handle the response per Section 6.3.
To stop a service: Read references/code-templates.yaml → stop.registry_read to resolve the job_id, read skills/platform/<platform>/SKILL.md, then follow Section 5.
Reference data (schemas, mappings, valid values — no instructions):
references/service.yaml— image mappings, validnetwork_archnames, job payload schema, env var names, secrets classification.references/request.yaml— endpoint definition, request field schema, response shapes, code examples.references/code-templates.yaml— Python templates for payload building, registry writes, readiness checks, and stop/request flows.
Secrets rule (applies to every generated code block in this skill)
Never ask the user to type a secret value into a prompt. For every secret value:
- Tell the user which environment variable to set (e.g.
export HF_TOKEN=...). - Generate code that reads it with
os.environ["VAR_NAME"]— never hard-code, interpolate, or prompt for the value.
Secret env vars (full list in references/service.yaml → secrets_handling):
HF_TOKEN, WANDB_API_KEY, CLEARML_API_ACCESS_KEY, CLEARML_API_SECRET_KEY, TAO_API_KEY, TAO_USER_KEY.
Safe to collect in the prompt: network_arch, model_path, num_gpus, prompt text, WANDB_* config URLs, CLEARML_*_HOST URLs.
1. What to collect from the user
| Input | Role |
|---|---|
network_arch | Chooses container image, the per-arch inner command shape (references/service.yaml → container_commands.<network_arch>), and neural_network_name in the job JSON when applicable. Must match a basename in valid_network_arch_config_basenames in references/service.yaml (e.g. cosmos-rl, cosmos-predict2.5). |
model_path | The trained model checkpoint. Valid forms: hf_model://<org>/<model> (HuggingFace Hub — set HF_TOKEN for gated models) or a local container filesystem path. Cloud URIs (s3://, gs://, az://) are NOT supported — the inference service has no cloud-storage dependency. Always ask the user; never substitute a placeholder. See references/service.yaml → model_path_protocols. |
platform | Compute platform: local-docker, brev, lepton, slurm, or kubernetes. |
num_gpus | Defaults to 1; minimum 1 for inference. |
2. Image resolution
Each network_arch has a sidecar config file named {network_arch}.config.json. Resolve the container image as follows:
- Read
{network_arch}.config.jsonand takeapi_params.image(e.g.COSMOS_RL). This is a key intodocker_image_defaults.mappinginreferences/service.yaml. - Look up that key in the mapping. If the host env var
IMAGE_<KEY>is set (e.g.IMAGE_COSMOS_RL), it overrides the mapped default. - The mapped value is normally a dotted key into the repo-root
versions.yamlmanifest (e.g.tao_toolkit.cosmos_rl). Resolve it to a concretenvcr.io/...image URI by looking upversions.yaml→images.<group>.<name>. Absolute URIs pass through unchanged, so anIMAGE_<KEY>env-var override that contains a full URI still works. The Python helper for this lives inreferences/code-templates.yaml. - If the config file is missing or
api_params.imageis empty, fall back to theCOSMOS_RLkey.
The config file also has spec_params.inference.model_path which drives folder vs file path semantics: if the value contains the substring folder, the container treats the path as a directory.
3. Environment variables (no callbacks)
Set these in env_payload before encoding env_json. Do not set TAO_LOGGING_SERVER_URL or TAO_ADMIN_KEY.
TAO_EXECUTION_BACKEND — must match the platform:
| Platform | TAO_EXECUTION_BACKEND value |
|---|---|
| local-docker | local-docker |
| brev | local-docker |
| lepton | lepton |
| slurm | slurm |
| kubernetes | local-k8s |
CLOUD_BASED — always "False" for this skill (disables callback posting to TAO_LOGGING_SERVER_URL).
GPU env vars — only needed when the platform skill does not handle GPU injection automatically:
- Tegra / Jetson:
--runtime=nvidiawithNVIDIA_DRIVER_CAPABILITIES=allandNVIDIA_VISIBLE_DEVICES=<ids>. - Standard x86 + nvidia-container-toolkit: use Docker
device_requests. The platform skill handles this.
4. Executing across platforms
The job payload and inner command (Sections 1–3) are platform-agnostic. For each platform, read skills/platform/<name>/SKILL.md for preflight checks and credentials before generating any execution code.
4.1 Build the inner command (per arch)
The inner-command shape is per network_arch — there is no uniform template. Look up the per-arch entry in references/service.yaml → container_commands.<network_arch>; if not present, the arch is unsupported — stop and ask. Pick the matching sub-block in references/code-templates.yaml → job_payload_builder.<network_arch>. Prefix the command with umask 0 && and keep it identical across platforms (local-docker, brev, lepton, slurm, kubernetes).
Common across arches:
job_id: freshuuid.uuid4()— becomes the container name and registry key.image: resolve per Section 2.- Secrets (
access_key,secret_key,HF_TOKEN, etc.) are read from env vars at runtime — never hard-code, never log or print.
Arch-specific notes (full details in references/service.yaml → container_commands):
cosmos-rl— single--job '<JOB_JSON>' --docker_env_vars '<ENV_JSON>'blob;json.dumps(...)+shlex.quote(...).env_payloadcarriesTAO_EXECUTION_BACKEND(per Section 3 table),TAO_API_JOB_ID,CLOUD_BASED=False. The inference service has no cloud-storage dependency;HF_TOKENis the only cred env var that ever applies (for gated HuggingFace models).cosmos-predict2.5— flag-stylecosmos_predict inference_microservice start ... --port 8080(nosetup.prefix; usestyro.conf.OmitArgPrefixes).--job/--docker_env_varsare not accepted. Translatemodel_pathto--checkpoint-path(local path) or--model <registered_key>(hf_model://); cloud URIs are rejected. The only cred env var that ever applies isHF_TOKENfor gated HuggingFace models. Per-request params (prompt, inference_type, num_output_frames, guidance, seed, num_steps, negative_prompt) go in the request body, not at startup.TAO_EXECUTION_BACKEND/TAO_API_JOB_ID/CLOUD_BASEDare unused and may be omitted.
4.2 Delegate execution to the platform skill
Read skills/platform/<platform>/SKILL.md and follow it to start the container.
Base parameters (all platforms):
| Parameter | Value |
|---|---|
image | resolved container image (Section 2) |
command | inner — the shell string built in Section 4.1 |
gpu_count | num_gpus |
env_vars | env_payload |
| job / container name | job_id — must equal the UUID from 4.1 so the registry can reference it |
host_port (local-docker, brev) | host-side port to bind to container port 8080. Default 8080, but must be unique per concurrent service — see the port-allocation rule below. |
Platform-specific additional inputs:
| Platform | Additional inputs |
|---|---|
| local-docker | None beyond base |
| brev | instance_id (optional — reuse an existing instance); on multi-credential / multi-workspace accounts also cloud_cred_id and workspace_group_id for first-create — see skills/platform/tao-run-on-brev/SKILL.md |
| lepton | resource_shape (GPU shape ID, e.g. gpu.8xh100-sxm); dedicated_node_group (optional) |
| slurm | partition and account — check SLURM_PARTITION/SLURM_ACCOUNT env vars; ask user if unset |
| kubernetes | namespace (default: default); image_pull_secret (required for nvcr.io images) |
Port binding (local-docker and brev): use direct docker run (not DockerSDK) so that -p <host_port>:8080 can be passed and the container name equals job_id exactly.
Port allocation rule (local-docker and brev, REQUIRED for concurrent services): Before starting a service, read the registry (/tmp/tao-inf-ms-state.json) and collect the set of host_port values from every existing entry on the same platform (and, for brev, the same instance_id). Pick the lowest free port starting from 8080 that is not in that set — e.g. host_port = next(p for p in range(8080, 8200) if p not in used_ports). The default 8080 only applies when no other service is running. This is what makes "start 3 services, each reachable at a distinct host_url" work; without it, services 2 and 3 fail with bind: address already in use. Lepton, SLURM, and kubernetes get distinct endpoints from their own platform mechanisms and do not need this step.
4.3 After start: service registry and endpoint
Write the service registry immediately after the platform confirms the container is running. The registry (/tmp/tao-inf-ms-state.json) is keyed by job_id; "latest" always points to the most recently started service.
See references/code-templates.yaml → registry_write.<platform> for the Python template.
| Platform | host_url | platform_job_id | Extra step before writing |
|---|---|---|---|
| local-docker | http://localhost:{host_port} | — | None |
| brev | http://{brev_ip}:{host_port} | — | brev ls → get instance IP (localhost is invalid on remote VM) |
| lepton | Lepton endpoint URL | job.id | Poll sdk.get_job_status until Running; get endpoint from console or lep job get <job.id> |
| slurm | http://localhost:{host_port} | SLURM scheduler job ID | Wait until Running; SSH port-forward localhost:{host_port}→{node}:8080 |
| kubernetes | http://{external_ip}:8080 | k8s job name | kubectl expose job … --type=LoadBalancer; wait for external IP |
After writing the registry, print the job_id and URL:
print(f"Inference service started.")
print(f" Job ID : {job_id}")
print(f" Arch : {network_arch}")
print(f" URL : {state[job_id]['host_url']}/v1/chat/completions")
print(f"Use this Job ID to send requests or stop the service.")
Then poll for readiness — see references/code-templates.yaml → readiness_check. The container loads the model in the background; do not send requests before it returns 200.
5. Stopping the inference service
Ask the user for the job_id to stop. If they don't provide one, default to state["latest"] and confirm which job_id is being stopped. Read the registry using references/code-templates.yaml → stop.registry_read, then read skills/platform/<platform>/SKILL.md and use its cancellation / stop mechanism.
| Platform | Identifier to pass | Extra cleanup |
|---|---|---|
| local-docker | job_id_to_stop — container name | None |
| brev | job_id_to_stop — container name | None |
| lepton | entry["platform_job_id"] — Lepton job ID | None |
| slurm | entry["platform_job_id"] — SLURM job ID | pkill -f "ssh.*-L.*{entry['host_port']}" |
| kubernetes | entry["platform_job_id"] — k8s job name | kubectl delete svc {entry["platform_job_id"]} -n <namespace> |
where entry = state[job_id_to_stop]. After stopping, clean up the registry: references/code-templates.yaml → stop.registry_cleanup.
6. Sending inference requests
6.0 Resolve which service receives this request (REQUIRED)
Each request must be routed to the specific service that runs the matching model. Routing happens by job_id — the registry stores network_arch per entry, so you can resolve a target by arch when the user names a model instead of a job_id. Apply these rules in order:
- User provided an explicit
job_id→ use it. Verify it exists instate. - User named a
network_arch(e.g. "send this to the cosmos-rl service") → look up matching entries:candidates = [j for j, e in state.items() if j != "latest" and isinstance(e, dict) and e["network_arch"] == arch].- Exactly one match → use it.
- Multiple matches → prompt the user with the candidate
job_ids and theirstarted_at; do not auto-pick. - No match → stop and tell the user no service for that arch is running.
- No
job_idand nonetwork_arch→ count non-"latest"entries instate:- Exactly one running service → use it.
- Two or more → do not silently default to
state["latest"]. Prompt the user with the full list (job_id,network_arch,host_url) and require an explicit choice. The"latest"pointer is a convenience for single-service workflows, not a routing fallback when multiple services coexist. - Zero → stop and tell the user to start a service first.
After resolving, read the endpoint from the registry (references/code-templates.yaml → request.registry_read), passing the resolved job_id as user_provided_job_id. Confirm to the user: "Sending to job_id=… arch=… url=…". If the service may still be loading, poll readiness first (references/code-templates.yaml → readiness_check).
Cross-check before sending: if the user-supplied request body contains arch-specific fields (e.g. guidance / num_steps / seed / negative_prompt → cosmos-predict2.5; required image_url/video_url content items → cosmos-rl), verify they are consistent with state[job_id]["network_arch"]. On mismatch, stop and ask — sending a cosmos-predict2.5 body to a cosmos-rl service will fail at the container with a 4xx/5xx that is harder to diagnose than catching it here.
6.1 Sampling parameters — REQUIRED user prompt before each request
Before constructing the request body, you MUST explicitly prompt the user for the vLLM-style sampling parameters. Do not silently apply defaults. Use a structured prompt (e.g. AskUserQuestion in Claude Code, one question per field) that:
- Lists every applicable field with its type and default value.
- Lets the user skip / accept any field to take that field's default — entering a value is never required.
- Collects all fields in one round.
After the prompt, apply each user-entered value verbatim and substitute the default for any skipped field. Do not invent values or silently clamp.
Field list, defaults, and per-arch applicability: references/request.yaml → chat_completions_request_body (base sampling fields: max_tokens, top_p, temperature) and network_arch_constraints.<network_arch> (per-arch overrides and extras such as guidance/num_steps/seed/negative_prompt for cosmos-predict2.5). If a field is marked unsupported for the active arch, do not prompt for it and do not include it in the body.
6.2 Request format
Send a POST to {BASE_URL}/v1/chat/completions with Content-Type: application/json and a timeout of at least 300 s. The body is OpenAI-compatible (vLLM chat completions); see references/request.yaml → chat_completions_request_body for the full field schema and content-item shapes (text / image_url / video_url), and code_examples for ready-to-run Python and curl samples.
Constraints: only the first user message is processed. No secret values in request bodies. Per-network constraints (e.g. cosmos-rl requires every request to include an image or video; cosmos-rl rejects data: URIs) are in references/request.yaml → network_arch_constraints.
6.3 Response handling
| HTTP status | Meaning | Action |
|---|---|---|
| 200 | Success — choices[0].message.content has the generated text | Read result |
| 202 | Server still initializing or model still loading | Retry after a delay |
| 503 | Initialization failed, model load failed, or model not yet ready | Inspect error.type: model_not_ready → retry; initialization_error / model_load_error → give up and check logs |
| 400 | Missing or empty JSON body | Fix request |
| 500 | Unhandled exception during inference | Check container logs |
For 202 and 503, the body contains {"error": {"type": "<error_type>", "message": "<reason>"}}. See container_response_shapes in references/request.yaml for error type strings.