Tao Finetune Cosmos Embed

Cosmos-Embed1 video-text embedding for text-to-video retrieval, video-to-video search, semantic deduplication, and fine-tuning. Use when the user asks to "fine-tune Cosmos-Embed1", "run cosmos-embed inference", "export Cosmos-Embed1", "embed videos", or "search videos with text".

Published by @NVIDIA·0 agent reads / 30d·0 saves·

Cosmos-Embed

Cosmos-Embed1 is a joint video-text embedder for text-to-video retrieval, video-to-video search, zero-shot/kNN classification, and semantic deduplication. The packaged CLI is cosmos-embed1 and supports train, evaluate, inference, and export.

Container image and per-action commands are in references/skill_info.yaml. Compact starting specs are in references/spec_template_*.yaml.

Train Action Policy

This model is AutoML-enabled at the model layer. Before handling any train-stage request, read references/skill_info.yaml and resolve the run override from either an explicit automl_policy value or the user's workflow request. Treat phrases like "turn off AutoML", "disable AutoML", "no HPO", or "plain training" as automl_policy: off for this run only; otherwise default to auto. When automl_policy: auto, automl_enabled: true, and both schemas/train.schema.json and references/spec_template_train.yaml are packaged, route the train action through tao-skill-bank:tao-run-automl by default with this model's skill_dir. Preserve workflow/application overrides for datasets, specs, output directories, GPU/platform settings, parent checkpoints, and automl_policy. Use direct model training only when automl_policy: off or the packaged train schema/template is missing; in the missing-schema case, report that AutoML is enabled but not runnable for this model until schemas are generated.

Non-train actions such as evaluate, inference, export, and deploy flows stay in this model skill. The per-run automl_policy override does not change model metadata.

Quick Start

Use the published Cosmos-Embed container declared by references/skill_info.yaml and resolved through versions.yaml. Do not build from the private Cosmos-Embed1 source tree for normal skill use; build from source only when developing the container itself.

TAO_SKILL_BANK_PATH="${TAO_SKILL_BANK_PATH:-$PWD}"
COSMOS_EMBED_IMAGE="${COSMOS_EMBED_IMAGE:-$(
  python "$TAO_SKILL_BANK_PATH/scripts/resolve_tao_image.py" \
    --skill-bank "$TAO_SKILL_BANK_PATH" \
    --model cosmos-embed \
    --action train \
    --format json |
  python -c 'import json,sys; print(json.load(sys.stdin)["image"])'
)}"
docker pull "$COSMOS_EMBED_IMAGE"

Expected local workspace layout:

workspace/
├── data/
│   ├── msrvtt_test_1k.json
│   └── video/
│       ├── video7020.mp4
│       └── ...
├── model/
│   └── Cosmos-Embed1-224p/        # optional if using HF repo id
├── specs/
│   ├── train.yaml
│   ├── evaluate.yaml
│   ├── inference.yaml
│   ├── export_onnx.yaml
│   └── export_hf.yaml
└── results/

Use these Docker options for all actions unless the local Docker/platform skill gives a stricter environment-specific command:

TAO_SKILL_BANK_PATH="${TAO_SKILL_BANK_PATH:-$PWD}"
COSMOS_EMBED_IMAGE="${COSMOS_EMBED_IMAGE:-$(
  python "$TAO_SKILL_BANK_PATH/scripts/resolve_tao_image.py" \
    --skill-bank "$TAO_SKILL_BANK_PATH" \
    --model cosmos-embed \
    --action train \
    --format json |
  python -c 'import json,sys; print(json.load(sys.stdin)["image"])'
)}"
RUN_ROOT="${RUN_ROOT:-$PWD}"
DOCKER_COMMON=(
  --rm --gpus all --ipc=host --network=host
  --shm-size=64g
  --ulimit memlock=-1
  --ulimit stack=67108864
  -e HF_TOKEN
  -v "$RUN_ROOT/data:/data:ro"
  -v "$RUN_ROOT/model:/model"
  -v "$RUN_ROOT/specs:/specs:ro"
  -v "$RUN_ROOT/results:/results"
)

Train:

docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 train -e /specs/train.yaml results_dir=/results

Evaluate:

docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 evaluate -e /specs/evaluate.yaml results_dir=/results

Inference:

docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 inference -e /specs/inference.yaml \
  'inference.query.input_texts=["a man is singing on stage"]' \
  inference.k=5 \
  results_dir=/results

Export ONNX:

docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 export -e /specs/export_onnx.yaml \
  export.checkpoint=/results/train/cosmos_embed1_model_latest.pth \
  export.onnx_file=/results/export/cosmos_embed1_combined.onnx \
  results_dir=/results

Export HuggingFace format:

docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 export -e /specs/export_hf.yaml \
  export.checkpoint=/results/train/cosmos_embed1_model_latest.pth \
  export.hf_output_dir=/results/export_hf/cosmos_embed1_hf \
  results_dir=/results

Smoke Overrides

For a small functional check, keep the same specs and override the expensive knobs:

train.max_iter=1
train.validation_iter=2
train.checkpoint_iter=1
train.optim.optim=adamw
dataset.train_dataset.batch_size=1
dataset.val_dataset.batch_size=1
dataset.train_dataset.workers=0
dataset.val_dataset.workers=0

If no local Cosmos-Embed1 pretrained checkpoint or HuggingFace token is available, set model.pretrained_model_path=null for a plumbing-only smoke train. The model quality is meaningless in that mode, but the train/evaluate/inference/export action paths can still be exercised.

For evaluation and inference smoke tests on a tiny subset:

evaluate.callbacks.embedding_visualization=false
evaluate.callbacks.max_eval_samples=8
dataset.test_dataset.batch_size=1
dataset.test_dataset.workers=0
inference.k=2
dataset.inference_dataset.batch_size=1
dataset.inference_dataset.workers=0

Data Format

The MSR-VTT path expects a local video glob and a JSON metadata file:

dataset:
  train_dataset:
    dataset_type: msrvtt
    mp4_urls: /data/video/*.mp4
    metadata: /data/msrvtt_test_1k.json

List-format metadata rows must include at least video and caption:

{"video_id": "video7020", "video": "video7020.mp4", "caption": "a woman creating a fondant baby and flower"}

The dataset loader derives the video id from the local .mp4 filename and filters to videos present in the metadata. If a run finds zero videos, check that mp4_urls points to a container-local glob and that metadata video names match the filenames.

Model Weights

  • Local HF directory: mount it under /model and set model.pretrained_model_path=/model/Cosmos-Embed1-224p.
  • HuggingFace repo: set model.pretrained_model_path=nvidia/Cosmos-Embed1-224p and pass HF_TOKEN if access is gated.
  • Fine-tuned checkpoint: downstream actions default to /results/train/cosmos_embed1_model_latest.pth.

Variants:

VariantResolutionFramesEmbedding dim
Cosmos-Embed1-224p224 x 2248256
Cosmos-Embed1-336p336 x 3368768
Cosmos-Embed1-448p448 x 4488768

Keep model.network.embed_dim, model.input_hw, and model.network.spatial_resolution aligned with the selected variant.

Important Parameters

ParameterNotes
train.num_gpus1 for single GPU, >1 auto-launches torchrun, -1 auto-detects visible GPUs.
train.max_iterMain training length. Use 1 only for smoke testing.
train.optim.optimfused_adamw is faster when available; adamw is safer for smoke and portability.
model.lora.enabledEnables LoRA. Set model.network.visual_encoder.transformer_engine=false when LoRA is on.
model.lora.lora_rankLoRA rank. Start with 8; try 4, 8, or 16 for manual or AutoML-style sweeps.
model.lora.lora_alphaLoRA scaling factor. Start with 16; keep near 2 * lora_rank unless experiments show otherwise.
model.lora.lora_dropoutLoRA dropout. Start with 0.1; sweep 0.0, 0.05, and 0.1 for small datasets.
model.lora.biasBias policy: none, all, or lora_only. Keep none unless intentionally training biases.
model.lora.use_rslora / use_doraOptional LoRA variants. Enable one at a time and record the setting with the checkpoint.
model.lora.target_modulesOptional module-name patterns for LoRA injection. Leave empty for the default ViT + Q-Former attention/MLP targets.
model.lora.modules_to_saveOptional modules to keep fully trainable alongside LoRA. Leave empty unless preserving a task-specific head.
evaluate.load_dataset_pkl / save_dataset_pklCache evaluation embeddings.
inference.load_dataset_pkl / save_dataset_pklCache the search database for repeated retrieval.
export.modevideo, text, combined, or huggingface.
export.on_cpuRecommended for export to avoid device mismatch issues.

LoRA and AutoML Notes

For parameter-efficient fine-tuning, set model.lora.enabled=true and keep model.network.visual_encoder.transformer_engine=false; TAO Core's Cosmos-Embed1 config notes that PEFT cannot inject adapters into Transformer Engine layers. Treat the LoRA fields above as the first candidate parameters for manual tuning or AutoML-style search before unfreezing larger model blocks. Avoid changing target_modules or modules_to_save unless the user explicitly needs custom adapter placement.

S3 Staging

The Cosmos-Embed1 CLI consumes local paths and Python globs, not raw s3://.../*.mp4 URIs. For S3-backed runs, first stage a subset or full dataset to the execution host/container filesystem, then use local paths such as /data/video/*.mp4 in the spec.

Recommended S3 layout for staged MSR-VTT data:

s3://bucket/path/cosmos-embed/msrvtt-subset/
├── msrvtt_test_1k.json
└── video/
    ├── video7020.mp4
    └── ...

After downloading/syncing that prefix into the mounted data/ directory, use the same Docker commands above.

Outputs

results/
├── train/
│   ├── cosmos_embed1_model_latest.pth
│   ├── cosmos_embed1_model_<iter>.pth
│   └── experiment.yaml
├── evaluate/
│   ├── metrics.json
│   └── experiment.yaml
├── inference/
│   ├── results.json
│   └── experiment.yaml
├── export/
│   ├── cosmos_embed1_combined.onnx
│   └── export_config.yaml
└── export_hf/
    └── cosmos_embed1_hf/

Known Pitfalls

SymptomCauseFix
MSRVTTDataset: 0 videos foundmp4_urls is not a local glob or metadata filenames do not match videos.Mount data into the container and set mp4_urls=/data/video/*.mp4.
HF download/auth failureMissing or invalid HF_TOKEN, or model agreement not accepted.Accept the model terms and pass -e HF_TOKEN.
LoRA injection failureTransformer Engine visual encoder is enabled.Set model.network.visual_encoder.transformer_engine=false.
ONNX/HF export complains about missing componentsExport checkpoint is partial or adapter-only.Use a full checkpoint or configure pretrained visual/text sources before export.
CUDA OOMBatch/resolution too high for the GPU.Reduce batch size, use 224p, enable LoRA, or use more GPUs.

Bundled with this artifact

10 files

Reference files that ship alongside this artifact. Agents pull these in only when the task needs them.

More on the bench

SKILL0

Tensorflow And Deep Learning Rules

TensorFlow and deep learning rules for building, training, evaluating, and deploying neural network models

data-science-ml+1
0
SKILL0

Fortran Programming Guidelines

Modern Fortran rules for scientific computing, modules, explicit interfaces, kind parameters, memory safety, and testing

software-engineering+1
0
SKILL0

Automl And Hyperparameter Optimization Rules

AutoML and hyperparameter optimization rules for Python ML projects using Ray Tune, Optuna, PyCaret, and time-series AutoML libraries

data-science-ml+1
0