Tao Train Depth Anything V2

Monocular depth estimation using Metric Depth Anything v2 or Relative Depth Anything architectures. Predicts per-pixel depth from single RGB images. Use when training, evaluating, exporting, or running inference for a TAO monocular depth model. Trigger phrases include "train monocular depth", "DepthAnything v2", "metric depth from single image", "monocular depth estimation".

Published by @NVIDIA·0 agent reads / 30d·0 saves·

Depth Net Mono

Monocular depth estimation using Metric Depth Anything v2 or Relative Depth Anything architectures. Predicts per-pixel depth from single RGB images.

Pretrained checkpoint loading varies by model variant and use case — see the Pretrained checkpoint loading — use case matrix in references/parameters.md.

The mono and stereo skills both invoke the unified TAO depth_net CLI inside the container; the mono/stereo family is selected via model.model_type (full parameter glossary in references/parameters.md).

For TAO Deploy TensorRT actions (gen_trt_engine, TensorRT evaluate, and TensorRT inference), read references/tao-deploy-depth-anything-v2.md first. The deploy spec template lives in this skill's references/spec_template_deploy.yaml.

Train Action Policy

This model is AutoML-enabled at the model layer. Before handling any train-stage request, read references/skill_info.yaml and resolve the run override from either an explicit automl_policy value or the user's workflow request. Treat phrases like "turn off AutoML", "disable AutoML", "no HPO", or "plain training" as automl_policy: off for this run only; otherwise default to auto. When automl_policy: auto, automl_enabled: true, and both schemas/train.schema.json and references/spec_template_train.yaml are packaged, route the train action through tao-skill-bank:tao-run-automl by default with this model's skill_dir. Preserve workflow/application overrides for datasets, specs, output directories, GPU/platform settings, parent checkpoints, and automl_policy. Use direct model training only when automl_policy: off or the packaged train schema/template is missing; in the missing-schema case, report that AutoML is enabled but not runnable for this model until schemas are generated.

Non-train actions such as evaluate, inference, export, and deploy flows stay in this model skill. The per-run automl_policy override does not change model metadata.

Workflow

Prerequisites — data accessibility

Your dataset (RGB images + GT depth files) must be reachable from inside the container:

  • SDK runner: place files at the S3 paths the runner resolves (the S3_TRAIN / S3_EVAL placeholders shown in the spec overrides). The runner handles S3 → container-path mounting transparently.
  • Direct docker run (e.g. local testing): mount the host dataset root read-only at the same in-container path:
docker run ... -v <host_data_root>:<host_data_root>:ro <container> ...

The same accessibility requirement applies to the <output_dir> written by all actions.

Step 1 — Annotation file

Per-line annotation file referenced by data_sources[*].data_file:

ColumnsFormatUse
1<image>Mono inference (no GT)
2<image> <gt_depth>Mono with GT

If you already have one, point to it. Otherwise generate via depth_net convert:

depth_net convert -e <convert_spec.yaml>

convert_spec.yaml template:

data_root: <directory whose immediate children are scene/sample folders that contain your image+depth files; convert walks data_root recursively but expects per-scene subdirectories at one level below>
image_dir_pattern: [<substring matching left/RGB image paths>]
depth_dir_pattern: [<substring matching GT depth paths>]
image_extension: ''     # optional .endswith filter, e.g. '.jpg'
depth_extension: ''     # optional, swapped during depth derivation, e.g. '.png'
split_ratio: 0.0        # 0.0/1.0 = test-only; 0.8 = 80/20 train+val

convert walks data_root recursively, selects paths whose path-string contains all substrings in image_dir_pattern (AND-filter), then derives the depth path by replacing image_dir_pattern[0] with depth_dir_pattern[0] and image_extension with depth_extension. Inspect your dataset's directory layout and identify the substring distinguishing RGB images from depth files (e.g. rgb_ vs sync_depth_).

data_root must point at the parent that contains the per-scene subdirectories (e.g. for NYU eval, use /data/nyu_v2/eval/test, not /data/nyu_v2/eval/test/bathroom — the latter limits the walk to a single scene). Always include the leading dot in image_extension / depth_extension (e.g. '.jpg' not 'jpg'); the substring swap is form-sensitive and a mismatch silently corrupts derived paths.

Step 2 — Pair model_type and dataset_name based on your data

Default — generic class for each task:

Data categorymodel_typedataset_name
Disparity-encoded data (pixels)RelativeDepthAnythingRelativeMonoDataset
Metric depth (meters)MetricDepthAnythingMetricMonoDataset
Mono inference (no GT, any image)matches train choiceRelativeMonoDataset or MetricMonoDataset

Dataset-specific class — switch when the data needs preprocessing the generic class does not perform:

Special casemodel_typedataset_nameWhat the class adds
NYU sync_depth_*.png (raw uint16 millimetres) — relativeRelativeDepthAnythingNYUDV2Relativemm→m unit conversion + Eigen evaluation crop
NYU sync_depth_*.png (raw uint16 millimetres) — metricMetricDepthAnythingNYUDV2same

Using a generic class on data that requires unit conversion (e.g. raw NYU uint16 PNGs) results in an empty valid mask and silent train_loss = NaN. Match the class to your data's encoding.

Step 3 — Write spec yaml from the spec overrides

Copy the action block from references/spec-overrides.md. Replace:

  • model.model_type from Step 2
  • dataset.<...>.data_sources[*].dataset_name from Step 2
  • data_sources[*].data_file with the path from Step 1 (S3 path under SDK runner, host path for direct docker)
  • For metric finetune: additionally apply the Metric Variant Finetuning Recipe in references/finetuning.md.

For mono training set train.precision: fp32 (recommended) or bf16 (Ampere SM80+, alternative).

Step 4 — Run

docker run --gpus 'device=0' --shm-size 16G --ipc=host \
  --user $(id -u):$(id -g) \
  -v <data_root>:<data_root>:ro \
  -v <output_dir>:<output_dir> \
  <container> \
  depth_net <action> -e <spec.yaml>

Without --user $(id -u):$(id -g) the container writes outputs as nobody:nogroup, blocking host-side cleanup and retry.

Step 5 — Verify

  • Container exit code 0
  • status.json kpi block populated
  • For train: inspect per-step train_loss directly — the entrypoint reports Execution status: PASS even when train_loss = NaN (see the Sanity-run PASS criteria in references/finetuning.md)
  • For evaluate / inference: artifacts under results_dir

For TAO Deploy TensorRT actions (gen_trt_engine, TensorRT evaluate, and TensorRT inference), read references/tao-deploy-depth-anything-v2.md first. Deploy spec templates live in this skill's references/ folder with the spec_template_deploy_*.yaml prefix.

Training Requirements

  • Valid dataset_name values for mono data_sources (case-insensitive): ThreeDVLM, FSD, NvCLIP, IssacStereo, Crestereo, Middlebury, NYUDV2, NYUDV2Relative, RelativeMonoDataset, MetricMonoDataset. NYUDV2 carries metric depth GT (meters) — pair with MetricDepthAnything; NYUDV2Relative is the same data with relative-depth conventions — pair with RelativeDepthAnything.
  • Monitoring metric: val/loss

Per-Action Dataset Requirements

ActionSpec KeySourceFilesList?
evaluatedataset.test_dataset.data_sourceseval_datasetdata_file: annotations.txt + dataset_nameYes
inferencedataset.infer_dataset.data_sourcesinference_datasetdata_file: annotations.txt + dataset_nameYes
quantizedataset.train_dataset.data_sourcestrain_datasetsdata_file: annotations.txt + dataset_nameYes
quantizedataset.val_dataset.data_sourceseval_datasetdata_file: annotations.txt + dataset_nameYes
quantizedataset.quant_calibration_dataset.images_dirtrain_datasetsimages.tar.gzNo
traindataset.train_dataset.data_sourcestrain_datasetsdata_file: annotations.txt + dataset_nameYes
traindataset.val_dataset.data_sourceseval_datasetdata_file: annotations.txt + dataset_nameYes

Spec Overrides

Data source overrides are mandatory for every action — construct the data source paths from the Per-Action Dataset Requirements table above and include them in spec_overrides; each data_sources entry is a dict with the two mandatory fields data_file and dataset_name. See references/spec-overrides.md for the full per-action train / evaluate / export / inference / quantize override blocks and the precision recommendations.

Eval Dataset

Optional. Val dataset configured via dataset.val_dataset.data_sources (each entry needs data_file and dataset_name).

Important Parameters

Full parameter glossary (model.*, train.*, dataset.*, export.*, inference.* fields with options, defaults, and sources) plus the Pretrained checkpoint loading — use case matrix live in references/parameters.md. Key starting points: model.model_type (default MetricDepthAnything), model.encoder (default vitl), train.optim.lr (default 1e-4, AdamW), train.precision (fp32 recommended), dataset.{train,val,test,infer}_dataset.augmentation.crop_size (default [518, 518]).

Finetuning Recipes

Relative and Metric variant finetuning recipes — including required spec keys, the metric dataset.{normalize_depth, min_depth, max_depth} block required in both train AND export specs, trainer-enforced defaults (clip_grad_norm: 0.1, warmup_steps: 20, weight_decay: 1e-4), sanity-run overrides, and the Sanity-run PASS criteria for catching silent train_loss = NaN — are in references/finetuning.md. Both recipes use train.optim.lr: 5e-6 with LambdaLR (the AdamW default 1e-4 is too aggressive when finetuning from a converged/pretrained backbone).

Multi-GPU / Multi-Node

Launch method: Lightning-managed (single python process, Lightning spawns workers).

Spec KeyDescriptionDefault
train.num_gpusNumber of GPUs1
train.gpu_idsGPU device indices[0]
train.num_nodesNumber of nodes1
train.distributed_strategyddp or fsdpddp
  • ddp with activation checkpointing: find_unused_parameters=False
  • ddp without: find_unused_parameters=True
  • fsdp forces precision to FP16

Multi-node env vars (set by orchestrator): WORLD_SIZE, NODE_RANK, MASTER_ADDR, MASTER_PORT, NUM_GPU_PER_NODE.

Export / TRT Defaults

  • TRT data types: FP32, BF16 (Ampere SM80+). FP16 is not supported for the ViT-L mono backbone.
  • Recommended TRT precision: bf16. Use fp32 if BF16 hardware is unavailable.

Full TAO Deploy reference: tao-deploy-depth-anything-v2.

Hardware

Minimum 1 GPU(s), recommended 2 GPU(s). 24GB+ VRAM per GPU. ViT-Large encoder is memory intensive. Use fp32 (recommended) or bf16 (Ampere SM80+, alternative) for training. Activation checkpointing is available for larger inputs.

Error Patterns

Common failure signatures and fixes — depth range mismatch, missing pretrained weights, Key 'encoder' not in 'MonoBackBone', missing dataset_name, depth_net_mono: not found, metric variant hyperparameter sourcing, and the export refuse-to-overwrite ONNX error — are documented in references/troubleshooting.md.

Spec Param / Parent Model Inference

Model-specific inference mappings (the full depth_net_mono.config.json per-action spec-field → inference-function table, plus parent_model / parent_job_id resolution guidance) are in references/spec-param-inference.md. These mappings belong in MD, not in config.json; generated runners should read that reference and apply the mappings with SDK helpers before create_job().

Bundled with this artifact

26 files

Reference files that ship alongside this artifact. Agents pull these in only when the task needs them.

More on the bench

SKILL0

Whisper

OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR.

data-science-ml+2
0
SKILL0

Guidance

Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance - Microsoft Research's constrained generation framework

ai-prompt-engineering+2
0
SKILL0

Pinecone

Managed vector database for production AI applications. Fully managed, auto-scaling, with hybrid search (dense + sparse), metadata filtering, and namespaces. Low latency (<100ms p95). Use for production RAG, recommendation systems, or semantic search at scale. Best for serverless, managed infrastructure.

data-science-ml+2
0