Cosmos-Embed

Cosmos-Embed1 is a joint video-text embedder for text-to-video retrieval, video-to-video search, zero-shot/kNN classification, and semantic deduplication. The packaged CLI is cosmos-embed1 and supports train, evaluate, inference, and export.

Container image and per-action commands are in references/skill_info.yaml. Compact starting specs are in references/spec_template_*.yaml.

Train Action Policy

This model is AutoML-enabled at the model layer. Before handling any train-stage request, read references/skill_info.yaml and resolve the run override from either an explicit automl_policy value or the user's workflow request. Treat phrases like "turn off AutoML", "disable AutoML", "no HPO", or "plain training" as automl_policy: off for this run only; otherwise default to auto. When automl_policy: auto, automl_enabled: true, and both schemas/train.schema.json and references/spec_template_train.yaml are packaged, route the train action through tao-skill-bank:tao-run-automl by default with this model's skill_dir. Preserve workflow/application overrides for datasets, specs, output directories, GPU/platform settings, parent checkpoints, and automl_policy. Use direct model training only when automl_policy: off or the packaged train schema/template is missing; in the missing-schema case, report that AutoML is enabled but not runnable for this model until schemas are generated.

Non-train actions such as evaluate, inference, export, and deploy flows stay in this model skill. The per-run automl_policy override does not change model metadata.

Quick Start

Use the published Cosmos-Embed container declared by references/skill_info.yaml and resolved through versions.yaml. Do not build from the private Cosmos-Embed1 source tree for normal skill use; build from source only when developing the container itself.

TAO_SKILL_BANK_PATH="${TAO_SKILL_BANK_PATH:-$PWD}"
COSMOS_EMBED_IMAGE="${COSMOS_EMBED_IMAGE:-$(
  python "$TAO_SKILL_BANK_PATH/scripts/resolve_tao_image.py" \
    --skill-bank "$TAO_SKILL_BANK_PATH" \
    --model cosmos-embed \
    --action train \
    --format json |
  python -c 'import json,sys; print(json.load(sys.stdin)["image"])'
)}"
docker pull "$COSMOS_EMBED_IMAGE"

Expected local workspace layout:

workspace/
├── data/
│   ├── msrvtt_test_1k.json
│   └── video/
│       ├── video7020.mp4
│       └── ...
├── model/
│   └── Cosmos-Embed1-224p/        # optional if using HF repo id
├── specs/
│   ├── train.yaml
│   ├── evaluate.yaml
│   ├── inference.yaml
│   ├── export_onnx.yaml
│   └── export_hf.yaml
└── results/

Use these Docker options for all actions unless the local Docker/platform skill gives a stricter environment-specific command:

TAO_SKILL_BANK_PATH="${TAO_SKILL_BANK_PATH:-$PWD}"
COSMOS_EMBED_IMAGE="${COSMOS_EMBED_IMAGE:-$(
  python "$TAO_SKILL_BANK_PATH/scripts/resolve_tao_image.py" \
    --skill-bank "$TAO_SKILL_BANK_PATH" \
    --model cosmos-embed \
    --action train \
    --format json |
  python -c 'import json,sys; print(json.load(sys.stdin)["image"])'
)}"
RUN_ROOT="${RUN_ROOT:-$PWD}"
DOCKER_COMMON=(
  --rm --gpus all --ipc=host --network=host
  --shm-size=64g
  --ulimit memlock=-1
  --ulimit stack=67108864
  -e HF_TOKEN
  -v "$RUN_ROOT/data:/data:ro"
  -v "$RUN_ROOT/model:/model"
  -v "$RUN_ROOT/specs:/specs:ro"
  -v "$RUN_ROOT/results:/results"
)

Train:

docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 train -e /specs/train.yaml results_dir=/results

Evaluate:

docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 evaluate -e /specs/evaluate.yaml results_dir=/results

Inference:

docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 inference -e /specs/inference.yaml \
  'inference.query.input_texts=["a man is singing on stage"]' \
  inference.k=5 \
  results_dir=/results

Export ONNX:

docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 export -e /specs/export_onnx.yaml \
  export.checkpoint=/results/train/cosmos_embed1_model_latest.pth \
  export.onnx_file=/results/export/cosmos_embed1_combined.onnx \
  results_dir=/results

Export HuggingFace format:

docker run "${DOCKER_COMMON[@]}" "$COSMOS_EMBED_IMAGE" \
  cosmos-embed1 export -e /specs/export_hf.yaml \
  export.checkpoint=/results/train/cosmos_embed1_model_latest.pth \
  export.hf_output_dir=/results/export_hf/cosmos_embed1_hf \
  results_dir=/results

Smoke Overrides

For a small functional check, keep the same specs and override the expensive knobs:

train.max_iter=1
train.validation_iter=2
train.checkpoint_iter=1
train.optim.optim=adamw
dataset.train_dataset.batch_size=1
dataset.val_dataset.batch_size=1
dataset.train_dataset.workers=0
dataset.val_dataset.workers=0

If no local Cosmos-Embed1 pretrained checkpoint or HuggingFace token is available, set model.pretrained_model_path=null for a plumbing-only smoke train. The model quality is meaningless in that mode, but the train/evaluate/inference/export action paths can still be exercised.

For evaluation and inference smoke tests on a tiny subset:

evaluate.callbacks.embedding_visualization=false
evaluate.callbacks.max_eval_samples=8
dataset.test_dataset.batch_size=1
dataset.test_dataset.workers=0
inference.k=2
dataset.inference_dataset.batch_size=1
dataset.inference_dataset.workers=0

Data Format

The MSR-VTT path expects a local video glob and a JSON metadata file:

dataset:
  train_dataset:
    dataset_type: msrvtt
    mp4_urls: /data/video/*.mp4
    metadata: /data/msrvtt_test_1k.json

List-format metadata rows must include at least video and caption:

{"video_id": "video7020", "video": "video7020.mp4", "caption": "a woman creating a fondant baby and flower"}

The dataset loader derives the video id from the local .mp4 filename and filters to videos present in the metadata. If a run finds zero videos, check that mp4_urls points to a container-local glob and that metadata video names match the filenames.

Model Weights

Local HF directory: mount it under /model and set model.pretrained_model_path=/model/Cosmos-Embed1-224p.
HuggingFace repo: set model.pretrained_model_path=nvidia/Cosmos-Embed1-224p and pass HF_TOKEN if access is gated.
Fine-tuned checkpoint: downstream actions default to /results/train/cosmos_embed1_model_latest.pth.

Variants:

Variant	Resolution	Frames	Embedding dim
`Cosmos-Embed1-224p`	224 x 224	8	256
`Cosmos-Embed1-336p`	336 x 336	8	768
`Cosmos-Embed1-448p`	448 x 448	8	768

Keep model.network.embed_dim, model.input_hw, and model.network.spatial_resolution aligned with the selected variant.

Important Parameters

Parameter	Notes
`train.num_gpus`	`1` for single GPU, `>1` auto-launches `torchrun`, `-1` auto-detects visible GPUs.
`train.max_iter`	Main training length. Use `1` only for smoke testing.
`train.optim.optim`	`fused_adamw` is faster when available; `adamw` is safer for smoke and portability.
`model.lora.enabled`	Enables LoRA. Set `model.network.visual_encoder.transformer_engine=false` when LoRA is on.
`model.lora.lora_rank`	LoRA rank. Start with `8`; try `4`, `8`, or `16` for manual or AutoML-style sweeps.
`model.lora.lora_alpha`	LoRA scaling factor. Start with `16`; keep near `2 * lora_rank` unless experiments show otherwise.
`model.lora.lora_dropout`	LoRA dropout. Start with `0.1`; sweep `0.0`, `0.05`, and `0.1` for small datasets.
`model.lora.bias`	Bias policy: `none`, `all`, or `lora_only`. Keep `none` unless intentionally training biases.
`model.lora.use_rslora` / `use_dora`	Optional LoRA variants. Enable one at a time and record the setting with the checkpoint.
`model.lora.target_modules`	Optional module-name patterns for LoRA injection. Leave empty for the default ViT + Q-Former attention/MLP targets.
`model.lora.modules_to_save`	Optional modules to keep fully trainable alongside LoRA. Leave empty unless preserving a task-specific head.
`evaluate.load_dataset_pkl` / `save_dataset_pkl`	Cache evaluation embeddings.
`inference.load_dataset_pkl` / `save_dataset_pkl`	Cache the search database for repeated retrieval.
`export.mode`	`video`, `text`, `combined`, or `huggingface`.
`export.on_cpu`	Recommended for export to avoid device mismatch issues.

LoRA and AutoML Notes

For parameter-efficient fine-tuning, set model.lora.enabled=true and keep model.network.visual_encoder.transformer_engine=false; TAO Core's Cosmos-Embed1 config notes that PEFT cannot inject adapters into Transformer Engine layers. Treat the LoRA fields above as the first candidate parameters for manual tuning or AutoML-style search before unfreezing larger model blocks. Avoid changing target_modules or modules_to_save unless the user explicitly needs custom adapter placement.

S3 Staging

The Cosmos-Embed1 CLI consumes local paths and Python globs, not raw s3://.../*.mp4 URIs. For S3-backed runs, first stage a subset or full dataset to the execution host/container filesystem, then use local paths such as /data/video/*.mp4 in the spec.

Recommended S3 layout for staged MSR-VTT data:

s3://bucket/path/cosmos-embed/msrvtt-subset/
├── msrvtt_test_1k.json
└── video/
    ├── video7020.mp4
    └── ...

After downloading/syncing that prefix into the mounted data/ directory, use the same Docker commands above.

Outputs

results/
├── train/
│   ├── cosmos_embed1_model_latest.pth
│   ├── cosmos_embed1_model_<iter>.pth
│   └── experiment.yaml
├── evaluate/
│   ├── metrics.json
│   └── experiment.yaml
├── inference/
│   ├── results.json
│   └── experiment.yaml
├── export/
│   ├── cosmos_embed1_combined.onnx
│   └── export_config.yaml
└── export_hf/
    └── cosmos_embed1_hf/

Known Pitfalls

Symptom	Cause	Fix
`MSRVTTDataset: 0 videos found`	`mp4_urls` is not a local glob or metadata filenames do not match videos.	Mount data into the container and set `mp4_urls=/data/video/*.mp4`.
HF download/auth failure	Missing or invalid `HF_TOKEN`, or model agreement not accepted.	Accept the model terms and pass `-e HF_TOKEN`.
LoRA injection failure	Transformer Engine visual encoder is enabled.	Set `model.network.visual_encoder.transformer_engine=false`.
ONNX/HF export complains about missing components	Export checkpoint is partial or adapter-only.	Use a full checkpoint or configure pretrained visual/text sources before export.
CUDA OOM	Batch/resolution too high for the GPU.	Reduce batch size, use 224p, enable LoRA, or use more GPUs.

Tao Finetune Cosmos Embed

Cosmos-Embed

Train Action Policy

Quick Start

Smoke Overrides

Data Format

Model Weights

Important Parameters

LoRA and AutoML Notes

S3 Staging

Outputs

Known Pitfalls

Bundled with this artifact

More on the bench

Tensorflow And Deep Learning Rules

Fortran Programming Guidelines

Automl And Hyperparameter Optimization Rules