Tao Train Fast Foundation Stereo

Real-time stereo depth estimation using FastFoundationStereo (FFS), the distilled bp2 commercial variant of FoundationStereo. Predicts disparity maps from stereo image pairs with ~10× lower latency than full FoundationStereo. Use when training, evaluating, exporting, or running inference for a TAO FastFoundationStereo (FFS) model. Trigger phrases include "train fast stereo", "real-time stereo disparity", "FastFoundationStereo", "distilled stereo depth".

Published by @NVIDIA·from NVIDIA/skills·0 agent reads / 30d·0 saves·

Depth Net Fast Stereo

Real-time stereo depth estimation using FastFoundationStereo (FFS) — the bp2 commercial distilled variant of FoundationStereo. Predicts disparity maps from rectified stereo image pairs with per-layer pruned widths for real-time inference.

The mono / stereo / fast-stereo skills share the unified TAO depth_net CLI; FFS is selected via model.model_type: FastFoundationStereo. FFS differs from FoundationStereo only in pruned per-layer widths and a serialized forward path; everything else (entrypoint, action verbs, dataset classes, deploy chain) is identical to depth-net-stereo.

For TAO Deploy TensorRT actions (gen_trt_engine, TensorRT evaluate, TensorRT inference), read references/tao-deploy-fast-foundation-stereo.md first. The deploy spec template lives at references/spec_template_deploy.yaml.

Train Action Policy

This model is AutoML-enabled at the model layer. Before handling any train-stage request, read references/skill_info.yaml and resolve the run override from either an explicit automl_policy value or the user's workflow request. Treat phrases like "turn off AutoML", "disable AutoML", "no HPO", or "plain training" as automl_policy: off for this run only; otherwise default to auto. When automl_policy: auto, automl_enabled: true, and both schemas/train.schema.json and references/spec_template_train.yaml are packaged, route the train action through tao-skill-bank:tao-run-automl by default with this model's skill_dir. Preserve workflow/application overrides for datasets, specs, output directories, GPU/platform settings, parent checkpoints, and automl_policy. Use direct model training only when automl_policy: off or the packaged train schema/template is missing; in the missing-schema case, report that AutoML is enabled but not runnable for this model until schemas are generated.

Non-train actions such as evaluate, inference, export, and deploy flows stay in this model skill. The per-run automl_policy override does not change model metadata.

Two Use Cases

FFS ships with a pre-trained bp2 commercial checkpoint (model_best_bp2_serialize.pth).

  1. Raw deploy — use the bp2 ckpt as-is. Skip train; run inference / evaluate / export / gen_trt_engine directly with the bp2 file as the action's checkpoint.
  2. Finetune on user data — set train.pretrained_model_path to the bp2 file, train on user data, then verify + deploy on the resulting ckpt. The full 7-action sequence (train → evaluate pyt → inference pyt → export → gen_trt_engine → inference deploy → evaluate deploy) is supported.

Workflow

Prerequisites — data accessibility

Your dataset (left + right images + GT disparity for train / evaluate, left + right only for inference) must be reachable from inside the container:

  • SDK runner: place files at the S3 paths the runner resolves (S3_TRAIN / S3_EVAL placeholders shown in the spec overrides).
  • Direct docker run (e.g. local testing): mount the host dataset root read-only at the same in-container path:
docker run ... -v <host_data_root>:<host_data_root>:ro <container> ...

The same accessibility requirement applies to the <output_dir> written by all actions, and to the bp2 checkpoint path.

Step 1 — Annotation file

Per-line annotation file referenced by data_sources[*].data_file. Schema is identical to depth-net-stereo:

ColumnsFormatUse
2<left> <right>Stereo inference (no GT)
3<left> <right> <disparity>Stereo with GT
4<left> <right> <disparity> <occlusion_mask>Stereo with GT and occlusion mask

Generate via depth_net convert if needed; see the depth-net-stereo skill for convert_spec.yaml template.

Step 2 — Pair model_type and dataset_name based on your data

Use model_type: FastFoundationStereo for FFS. The dataset_name choice mirrors the stereo skill — pick the dataset-specific class when your layout matches a registered one, otherwise GenericDataset.

Data categorymodel_typedataset_name
MiddleburyFastFoundationStereoMiddlebury
KITTIFastFoundationStereoKitti
ETH3DFastFoundationStereoEth3d
FSD syntheticFastFoundationStereoFSD
IsaacReal syntheticFastFoundationStereoIsaacRealDataset
Crestereo syntheticFastFoundationStereoCrestereo
Other / non-canonicalFastFoundationStereoGenericDataset

For inference with 2-column annotations (left + right, no GT), use dataset_name: GenericDataset regardless of layout.

Step 3 — Set the bp2 distilled width overrides

FFS requires 15 model-section width override fields whose values match the bp2 commercial checkpoint exactly. Omitting any field falls back to TAO defaults that do not match the bp2 ckpt and produce shape-mismatch errors at forward time.

model:
  model_type: FastFoundationStereo
  encoder: vitl
  hidden_dims: [128]                    # 1-layer GRU; NOT [128,128,128]
  n_gru_layers: 1                       # bp2 single-GRU
  corr_radius: 4
  corr_levels: 2
  n_downsample: 2
  valid_iters: 8
  max_disparity: 192                    # bp2 commercial; NOT 416 (full FS default)
  volume_dim: 28                       # bp2 ckpt invariant; NOT 32 (full FS default)
  mixed_precision: false                # see references/parameters.md
  gwc_feature_normalize: true           # see references/parameters.md

  # 15 bp2 distilled width overrides — copy as-is
  motion_encoder_widths: [56, 96, 16, 12]
  motion_encoder_final: 48
  gru_hidden: 60
  gru_gating_conv_widths: [100, 168]
  disp_head_input_dim: 60
  disp_head_intermediate: 36
  disp_head_pwconv1_widths: [212, 244]
  mask_widths: [32, 16]
  stem_2_widths: [12, 16]
  spx_2_gru_widths: [16, 12, 16, 24]
  spx_gru_out: 9
  classifier_mid: 14
  cnet_conv04_widths: [60, 48]
  cam_mid_channels: 8
  cost_agg_conv_patch_padding: [0, 0, 0]

The spec templates at references/spec_template_*.yaml carry this block as the canonical source.

Step 4 — Write spec yaml from the spec overrides

Copy the action block from references/spec-overrides.md (per-action Python override dicts plus the shared FFS_MODEL_BLOCK). Replace:

  • model.model_type: FastFoundationStereo (already set)
  • dataset.<...>.data_sources[*].dataset_name from Step 2
  • dataset.<...>.data_sources[*].data_file with the path from Step 1
  • For raw deploy use cases (no train): set <action>.checkpoint to the bp2 file path
  • For finetune use cases: set train.pretrained_model_path to the bp2 file path

Chained train → next action checkpoint path: For local Docker chaining (no SDK runner), the trained checkpoint lives at <train.results_dir>/<task>/dn_model_latest.pth — Lightning ModelCheckpoint nests under the task name. Example: train.results_dir: /workspace/results/finetune/train produces /workspace/results/finetune/train/train/dn_model_latest.pth. Use that nested path for the next action's <action>.checkpoint. SDK-runner deploys resolve this automatically via parent_job_id — see references/parent-model-inference.md.

Shape consistency: crop_size in dataset.test_dataset.augmentation.crop_size should match export.input_height / input_width for end-to-end pyt-vs-deploy comparability — see references/tao-deploy-fast-foundation-stereo.md's shape table.

Step 5 — Run

docker run --gpus 'device=0' --shm-size 16G --ipc=host \
  --user $(id -u):$(id -g) \
  -v <data_root>:<data_root>:ro \
  -v <output_dir>:<output_dir> \
  -v <bp2_ckpt_dir>:<bp2_ckpt_dir>:ro \
  <container> \
  depth_net <action> -e <spec.yaml>

Without --user $(id -u):$(id -g) the container writes outputs as nobody:nogroup, blocking host-side cleanup / retry.

For the local bind-mount __pycache__ caveat (QA / development only — clearing stale .pyc files that shadow patched source), see references/troubleshooting.md → "Local bind-mount tip".

Step 6 — Verify

  • Container exit code 0
  • status.json kpi block populated
  • For train: inspect per-step train_loss directly (the entrypoint reports Execution status: PASS even when loss is NaN)
  • For evaluate: rely on epe / bp1 / bp2 / bp3 / d1 / rmse (the evaluator also emits abs_rel / sq_rel / rmse_log which are non-meaningful for stereo)
  • For inference: artifacts under results_dir
  • KPI namespace difference between pyt and deploy: pyt evaluate writes the metric set under kpi.val/epe, kpi.val/bp1, etc. (namespaced by Lightning's val/ prefix). Deploy evaluate (TRT engine path) writes the same metric set under kpi.epe, kpi.bp1, etc. (no val/ prefix). Downstream verification scripts that read status.json need to handle both shapes.
  • Validate drift on your own dataset: if you compare TAO FFS deploy (gen_trt_engine + TRT evaluate) against the upstream FFS deploy path on the same input, expect a small residual mean_abs disparity drift (TAO export graph + TRT 10.13 interaction; not improvable at the source-code level). The exact magnitude is dataset and hardware dependent — measure on your own data and decide whether the drift is acceptable for your downstream task.

7-action deploy flow

train (optional)            → finetuned ckpt
evaluate (pyt)              → PyT eager EPE / bp on val GT
inference (pyt)             → PyT eager disparity samples (visual sanity)
export                      → static fp32 ONNX (recommended at 480×736 or 320×736)
gen_trt_engine             → fp16 TRT engine on static ONNX path
inference (deploy)         → TRT disparity samples
evaluate (deploy)          → TRT EPE / bp drift vs PyT eager fp32

Skip train for raw-bp2 deploy. The remaining 6 actions (or the 4 deploy-only verbs starting from export) cover both use cases.

Full TAO Deploy reference: tao-deploy-fast-foundation-stereo.

Training Requirements

  • Valid dataset_name values for stereo data_sources (case-insensitive): FSD, IsaacRealDataset, Crestereo, Middlebury, Eth3d, Kitti, GenericDataset
  • Monitoring metric: val/loss

Per-Action Dataset Requirements

ActionSpec KeySourceFilesList?
evaluatedataset.test_dataset.data_sourceseval_datasetdata_file: annotations.txt + dataset_nameYes
inferencedataset.infer_dataset.data_sourcesinference_datasetdata_file: annotations.txt + dataset_nameYes
traindataset.train_dataset.data_sourcestrain_datasetsdata_file: annotations.txt + dataset_nameYes
traindataset.val_dataset.data_sourceseval_datasetdata_file: annotations.txt + dataset_nameYes

Data source overrides are mandatory for every action. Each data_sources entry needs both data_file and dataset_name. The model.* width fields from Step 3 are also mandatory. See references/spec-overrides.md for the complete per-action override dicts (train finetune, raw-bp2 evaluate / inference / export) and the shared FFS_MODEL_BLOCK.

Eval Dataset

Optional. Val dataset configured via dataset.val_dataset.data_sources (each entry needs data_file and dataset_name).

Parameters, Metrics, Hardware

See references/parameters.md for the full parameter glossary (model.* / dataset.* / train.* knobs including max_disparity: 192, gwc_feature_normalize: true, mixed_precision: false, volume_dim: 28, valid_iters, save_raw_pfm), the evaluation-metric table (epe / bp1 / bp2 / bp3 / d1 / rmse are meaningful; abs_rel / sq_rel / rmse_log are not), multi-GPU / multi-node spec keys, and hardware requirements.

Export / TRT Defaults

export always emits a fp32 ONNX regardless of model.mixed_precision; the fp16 vs fp32 selection happens at gen_trt_engine via gen_trt_engine.tensorrt.data_type. Recommended TRT precision for FFS-bp2 is fp16 on the static-shape ONNX path (lowest drift). The dynamic-shape path supports both fp32 (default; static-fp32 parity) and fp16 (latency-critical multi-resolution; higher drift, may NaN under some checkpoint states — fall back to fp32 if observed).

See references/export-trt-defaults.md for the full TRT/ONNX defaults and the four-way export use-case matrix (export.batch_size × export.dynamic_hw; dynamic H/W is FFS-only). See references/tao-deploy-fast-foundation-stereo.md for the deployment matrix and static-vs-dynamic shape guidance.

Troubleshooting

See references/troubleshooting.md for error patterns and fixes, including shape mismatch at forward (missing width override), missing gwc_feature_normalize (TAO Core too old), dynamic_hw: true warning on FS / mono export, Key 'encoder' not in 'StereoBackBone', missing dataset_name in data_sources, negative disparity, larger-than-expected disparity drift (missing max_disparity: 192), depth_net_stereo: not found, decorative pyt-eval crop_size, the cosmetic Failed to import SAM3 warning, and silent dynamic-deploy stride-incompatibility.

Spec Param / Parent Model Inference

Model-specific inference mappings belong in this skill, not in config.json. Generated runners should apply the mappings with SDK helpers before create_job(). See references/parent-model-inference.md for the full per-action spec-field → inference-function mapping table.

For parent_model or parent_model_folder, pass the upstream train / export / AutoML child job id as parent_job_id. The SDK lists the parent result folder, filters checkpoint artifacts, and returns the selected model file or folder. For raw-bp2 use cases without a parent train job, set the <action>.checkpoint field explicitly to the bp2 file path. Do not patch generated runner scripts to guess checkpoint paths.

Bundled with this artifact

17 files

Reference files that ship alongside this artifact. Agents pull these in only when the task needs them.

More on the bench

SKILL0

Tensorflow And Deep Learning Rules

TensorFlow and deep learning rules for building, training, evaluating, and deploying neural network models

data-science-ml+1
0
SKILL0

Fortran Programming Guidelines

Modern Fortran rules for scientific computing, modules, explicit interfaces, kind parameters, memory safety, and testing

software-engineering+1
0
SKILL0

Automl And Hyperparameter Optimization Rules

AutoML and hyperparameter optimization rules for Python ML projects using Ray Tune, Optuna, PyCaret, and time-series AutoML libraries

data-science-ml+1
0