Tao Train Foundation Stereo

Stereo depth estimation using FoundationStereo. Predicts disparity maps from stereo image pairs for 3D reconstruction. Use when training, evaluating, exporting, or running inference for a TAO FoundationStereo model. Trigger phrases include "train stereo depth", "FoundationStereo", "stereo disparity estimation", "3D reconstruction from stereo".

Published by @NVIDIA·0 agent reads / 30d·0 saves·

Depth Net Stereo

Stereo depth estimation using FoundationStereo architecture. Predicts disparity maps from stereo image pairs for 3D reconstruction.

Uses pretrained Depth Anything v2 and EdgeNeXt encoders. Set model.stereo_backbone.depth_anything_v2_pretrained_path and model.stereo_backbone.edgenext_pretrained_path.

The mono and stereo skills both invoke the unified TAO depth_net CLI inside the container; the mono/stereo family is selected via model.model_type (e.g., FoundationStereo).

For TAO Deploy TensorRT actions (gen_trt_engine, TensorRT evaluate, and TensorRT inference), read references/tao-deploy-foundation-stereo.md first. The deploy spec template lives in this skill's references/spec_template_deploy.yaml.

Train Action Policy

This model is AutoML-enabled at the model layer. Before handling any train-stage request, read references/skill_info.yaml and resolve the run override from either an explicit automl_policy value or the user's workflow request. Treat phrases like "turn off AutoML", "disable AutoML", "no HPO", or "plain training" as automl_policy: off for this run only; otherwise default to auto. When automl_policy: auto, automl_enabled: true, and both schemas/train.schema.json and references/spec_template_train.yaml are packaged, route the train action through tao-skill-bank:tao-run-automl by default with this model's skill_dir. Preserve workflow/application overrides for datasets, specs, output directories, GPU/platform settings, parent checkpoints, and automl_policy. Use direct model training only when automl_policy: off or the packaged train schema/template is missing; in the missing-schema case, report that AutoML is enabled but not runnable for this model until schemas are generated.

Non-train actions such as evaluate, inference, export, and deploy flows stay in this model skill. The per-run automl_policy override does not change model metadata.

Workflow

Prerequisites — data accessibility

Your dataset (left + right images + GT disparity) must be reachable from inside the container:

  • SDK runner: place files at the S3 paths the runner resolves (the S3_TRAIN / S3_EVAL placeholders shown in Typical Spec Overrides). The runner handles S3 → container-path mounting transparently.
  • Direct docker run (e.g. local testing): mount the host dataset root read-only at the same in-container path:
docker run ... -v <host_data_root>:<host_data_root>:ro <container> ...

The same accessibility requirement applies to the <output_dir> written by all actions.

Step 1 — Annotation file

Per-line annotation file referenced by data_sources[*].data_file:

ColumnsFormatUse
2<left> <right>Stereo inference (no GT)
3<left> <right> <disparity>Stereo with GT
4<left> <right> <disparity> <occlusion_mask>Stereo with GT and occlusion mask

If you already have one, point to it. Otherwise generate via depth_net convert:

depth_net convert -e <convert_spec.yaml>

convert_spec.yaml template (stereo):

data_root: <directory whose immediate children are scene folders that contain your image+depth files; convert walks data_root recursively but expects per-scene subdirectories at one level below>
image_dir_pattern: [<substring matching left image paths>]
right_dir_pattern: [<substring matching right image paths>]
depth_dir_pattern: [<substring matching GT disparity paths>]
nocc_dir_pattern: []                 # optional, occlusion mask paths
image_extension: '.png'  # always include the leading dot
depth_extension: '.png'  # form must match image_extension (the swap is a substring replace)
nocc_extension: ''
split_ratio: 0.0        # 0.0/1.0 = test-only; 0.8 = 80/20 train+val

convert walks data_root recursively, selects paths whose path-string contains all substrings in image_dir_pattern (AND-filter), then derives right / depth / mask paths by replacing image_dir_pattern[0] with the corresponding pattern's first element plus extension swap. Inspect your dataset's directory layout and identify the substrings distinguishing left, right, and GT (e.g. im0 vs im1 vs disp0GT for Middlebury).

Step 2 — Pair model_type and dataset_name based on your data

Prefer the dataset-specific class when your layout matches a supported one — it applies class-specific path conventions, evaluation crops, and (where applicable) occlusion-mask handling. Fall back to GenericDataset only for layouts that do not match any registered class.

Data categorymodel_typedataset_name
Middlebury dataFoundationStereoMiddlebury
KITTI dataFoundationStereoKitti
ETH3D dataFoundationStereoEth3d
FSD synthetic dataFoundationStereoFSD
IsaacReal synthetic dataFoundationStereoIsaacRealDataset
Crestereo synthetic dataFoundationStereoCrestereo
Other / non-canonical layoutFoundationStereoGenericDataset

See Training Requirements → Formats for the full registered-class list. The same dataset_name value applies across train and evaluate actions (all of which use 3-column or 4-column annotations with GT disparity). The deploy-side evaluate action follows the same rule — see references/tao-deploy-foundation-stereo.md. For inference with 2-column annotations (left + right, no GT), use dataset_name: GenericDataset regardless of data layout — the dataset-specific classes (Middlebury / Kitti / Eth3d / FSD / IsaacRealDataset / Crestereo) require 3-column input and reject 2-column annotations at the dataloader level. For inference with 3-column annotations (left + right + GT), the dataset-specific class is fine.

Step 3 — Write spec yaml from Typical Spec Overrides

Copy the action block from references/foundation-stereo-spec-overrides.md (per-action spec_overrides, mandatory data sources). Replace:

  • model.model_type from Step 2 (typically FoundationStereo)
  • dataset.<...>.data_sources[*].dataset_name from Step 2
  • dataset.<...>.data_sources[*].data_file with the path from Step 1
  • For deploy-side evaluate: enforce dataset.test_dataset.batch_size: 1 (see references/tao-deploy-foundation-stereo.md).

Shape consistency: the crop_size in dataset.test_dataset.augmentation.crop_size should match export.input_height / input_width so the trained-model evaluator and the deploy-side TensorRT evaluator operate at the same shape — see references/foundation-stereo-troubleshooting.md.

Step 4 — Run

docker run --gpus 'device=0' --shm-size 16G --ipc=host \
  --user $(id -u):$(id -g) \
  -v <data_root>:<data_root>:ro \
  -v <output_dir>:<output_dir> \
  <container> \
  depth_net <action> -e <spec.yaml>

Without --user $(id -u):$(id -g) the container writes outputs as nobody:nogroup, blocking host-side cleanup / retry.

Step 5 — Verify

  • Container exit code 0
  • status.json kpi block populated
  • For train: inspect per-step train_loss directly (the entrypoint reports Execution status: PASS even when loss is NaN)
  • For evaluate: rely on epe / bp1 / bp2 / bp3 / d1 / rmse (the evaluator also emits abs_rel / sq_rel / rmse_log which are non-meaningful for stereo — see references/foundation-stereo-parameters.md)
  • For inference: artifacts under results_dir

For TAO Deploy TensorRT actions (gen_trt_engine, TensorRT evaluate, and TensorRT inference), read references/tao-deploy-foundation-stereo.md first. Deploy spec templates live in this skill's references/ folder with the spec_template_deploy_*.yaml prefix.

Training Requirements

  • Valid dataset_name values for stereo data_sources (case-insensitive): FSD, IsaacRealDataset, Crestereo, Middlebury, Eth3d, Kitti, GenericDataset
  • Monitoring metric: val/loss

Per-Action Dataset Requirements

ActionSpec KeySourceFilesList?
evaluatedataset.test_dataset.data_sourceseval_datasetdata_file: annotations.txt + dataset_nameYes
inferencedataset.infer_dataset.data_sourcesinference_datasetdata_file: annotations.txt + dataset_nameYes
quantizedataset.train_dataset.data_sourcestrain_datasetsdata_file: annotations.txt + dataset_nameYes
quantizedataset.val_dataset.data_sourceseval_datasetdata_file: annotations.txt + dataset_nameYes
quantizedataset.quant_calibration_dataset.images_dirtrain_datasetsimages.tar.gzNo
traindataset.train_dataset.data_sourcestrain_datasetsdata_file: annotations.txt + dataset_nameYes
traindataset.val_dataset.data_sourceseval_datasetdata_file: annotations.txt + dataset_nameYes

Typical Spec Overrides

Data source overrides are mandatory for every action — the agent MUST construct data source paths from the Per-Action Dataset Requirements table above and include them in spec_overrides. Each data_sources entry is a dict with two mandatory fields: data_file and dataset_name.

See references/foundation-stereo-spec-overrides.md for the full per-action spec_overrides blocks (train, evaluate, export, gen_trt_engine, inference, quantize) with S3_TRAIN / S3_EVAL placeholders.

Eval Dataset

Optional. Val dataset configured via dataset.val_dataset.data_sources (each entry needs data_file and dataset_name).

Important Parameters

Key defaults: model.model_type = FoundationStereo (only selectable type); model.encoder (top-level, not under stereo_backbone) schema default vitl but FS small NGC ckpt requires vits, override explicitly; model.max_disparity default 416; train.optim.lr default 1e-4; train.precision fp32 (recommended) or fp16 (no bf16); export.batch_size default -1. The workers field name is workers, not num_workers.

See references/foundation-stereo-parameters.md for the full parameter glossary (all model.*, dataset.*, train.*, export.* fields with defaults and ranges) and the Evaluation Metrics reference (which epe / bp* / d1 / rmse to trust and why abs_rel / sq_rel / rmse_log are non-meaningful for stereo).

Multi-GPU / Multi-Node

Launch method: Lightning-managed (single python process, Lightning spawns workers).

Spec KeyDescriptionDefault
train.num_gpusNumber of GPUs1
train.gpu_idsGPU device indices[0]
train.num_nodesNumber of nodes1
train.distributed_strategyddp or fsdpddp

Same DDP/FSDP behavior as depth-net-mono. Multi-node requires WORLD_SIZE, NODE_RANK, MASTER_ADDR, MASTER_PORT env vars.

Export / TRT Defaults

TRT data types FP32 / FP16. Static-shape ONNX (export.batch_size: 1) and batch-only dynamic ONNX (export.batch_size: -1) both support fp16; height and width are always pinned to the trace shape (H/W-dynamic engines are not supported — build separate engines per (H, W)). For the NGC release (576×960), set export.batch_size: 1, export.opset_version: 17, export.on_cpu: True.

See references/foundation-stereo-export-trt-hardware.md for the full export / TRT defaults (the opset-vs-on_cpu pairing rules, determinism notes, on_cpu GPU-memory thresholds) and the Hardware requirements. See references/tao-deploy-foundation-stereo.md for the three supported deploy paths and the validation table.

Full TAO Deploy reference: tao-deploy-foundation-stereo.

Error Patterns

Common issues: disparity overflow (reduce model.max_disparity); missing pretrained paths (set both model.stereo_backbone.depth_anything_v2_pretrained_path and model.stereo_backbone.edgenext_pretrained_path); Key 'encoder' not in 'StereoBackBone' (encoder is top-level model.encoder); Key 'dataset_name' is not in struct (each data_sources entry needs both data_file and dataset_name); bash: exec: depth_net_stereo: not found (entrypoint is depth_net, no suffix).

See references/foundation-stereo-troubleshooting.md for the full error patterns plus the pyt-vs-deploy crop_size discussion (the pyt evaluate path runs at native image resolution and ignores crop_size, with the Middlebury resolution guidance) and the Shape consistency rule.

Spec Param / Parent Model Inference

Model-specific inference mappings belong in MD, not in config.json. Generated runners read these mappings and apply them with SDK helpers before create_job() (mirrors the old microservices infer_params.py flow). For parent_model / parent_model_folder, pass the upstream train/export/AutoML child job id as parent_job_id; the SDK lists the parent result folder, filters checkpoint artifacts, and returns the selected model file or folder. Do not add these mappings back to config.json and do not patch generated runner scripts to guess checkpoint paths.

See references/foundation-stereo-spec-param-inference.md for the full per-action inference-mapping table (train / evaluate / inference / export / gen_trt_engine / quantize, including the train pretrained-path link/destination and resume-checkpoint mappings).

Bundled with this artifact

26 files

Reference files that ship alongside this artifact. Agents pull these in only when the task needs them.

More on the bench

SKILL0

Whisper

OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR.

data-science-ml+2
0
SKILL0

Guidance

Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance - Microsoft Research's constrained generation framework

ai-prompt-engineering+2
0
SKILL0

Pinecone

Managed vector database for production AI applications. Fully managed, auto-scaling, with hybrid search (dense + sparse), metadata filtering, and namespaces. Low latency (<100ms p95). Use for production RAG, recommendation systems, or semantic search at scale. Best for serverless, managed infrastructure.

data-science-ml+2
0