Video Reasoning Annotation Pipeline

Generate Chain-of-Thought training datasets from videos by producing multi-level captions, structured descriptions, and QA pairs (MCQ, binary, open-ended) with step-by-step reasoning traces. Domain-agnostic by default — customize prompts for any video domain.

Purpose

Transform raw videos into CoT Q&A training data for video understanding models. VLMs (e.g., Gemini, Qwen) act as "teacher" annotators: Steps 0–1 require the model to see the video (VLM calls); Steps 2–3 are text-to-text (cheaper LLM calls).

Pipeline architecture

Step 0:  [Optional] Filter & classify videos  → Keep domain-relevant, classify anomaly vs normal
Step 1a: Global + dense captions               → VLM: narrative summary + timestamped events
Step 1b: Chunk captions                         → VLM: fixed-duration segment micro-captions
Step 1c: [Optional, anomaly only] Highlight     → LLM extracts anomaly timestamp, VLM captions clip
Step 2:  Description synthesis                  → LLM: synthesize captions into structured narrative
Step 3:  QA generation                          → LLM: MCQ, binary, open-ended with reasoning
Step 4:  Parse outputs                          → Per-task `tao-vl-reason-v1.0` JSON files

Steps are individually selectable via workflow.steps. The pipeline has built-in resume — each step skips already-processed videos, so re-running after a prompt tweak is safe.

Initial consultation

When the user invokes this skill, walk through these questions in order. Don't skip — getting domain and VLM access right up front prevents wasted runs.

1. Videos

Path to the video directory and/or a JSONL with {"video_path": "..."} per line.
Confirm format (.mp4 preferred; .avi, .mov, .mkv also walked).

2. Domain — drives prompt selection

Ask the user: "What domain are these videos from?" Choose one of the following branches:

Domain	What to do
general	Use the default prompts. Set `prompts_module: ""` (or omit). The built-in `nvidia_tao_ds.auto_label.video_reasoning_annotation.prompts` covers domain-agnostic content.
traffic (CCTV intersections, highways; dashcam excluded)	Use the reference module. Set `prompts_module: "nvidia_tao_ds.auto_label.video_reasoning_annotation.prompts_traffic"`, or copy `references/prompts_traffic.py` into the user's project and tune for their specific camera angles, then point `prompts_module` at the copy.
warehouse (industrial site CCTV — safety, operations, security)	Same pattern. Set `prompts_module: "nvidia_tao_ds.auto_label.video_reasoning_annotation.prompts_warehouse"`, or copy `references/prompts_warehouse.py` and tune.
custom (any other domain)	Run the workshop in references/domain_adaptation.md. It walks through: Phase 1 — question types the user wants the model to answer; Phase 2 — caption-requirements checklist; Phase 3 — fill the `[PLACEHOLDER]` markers in `nvidia_tao_ds.auto_label.video_reasoning_annotation.prompt_template`. The two reference modules above are working examples to model after. Do this before any pipeline runs.

3. Anomaly / normal / mixed

Mixed dataset → workflow.mode: "auto" (Step 0 classifies each video).
Pre-split anomaly only → workflow.mode: "anomaly", drop Step 0.
Pre-split normal only → workflow.mode: "normal", drop Steps 0 and 1c.

4. VLM / LLM endpoint — confirm access before running

Gemini (default for both vlm.backend and llm.backend): user needs GOOGLE_API_KEY set, or to put the key in the YAML.
OpenAI-compatible (Qwen via vLLM, NIM endpoint, etc.): user provides base_url, model_name, and api_key.
Steps 2–3 are text-only — a smaller/cheaper LLM is fine for llm.backend even when vlm.backend is a frontier video model.

If the user has no endpoint at all and wants to self-host, point them at the skills/applications/tao-run-inference-service skill — a workflow that stands up a network-specific TAO inference microservice locally and exposes an OpenAI-compatible endpoint. Should support Cosmos, Qwen, and Gemma. Check skills/applications/tao-run-inference-service/references/service.yaml for the current valid_network_arch_config_basenames list before relying on a specific model.

If the user doesn't have endpoint access ready and isn't ready to set one up, stop here and help them figure it out first.

5. Pilot vs full run

Recommend a 5–10 video pilot when domain is custom, when any prompt was edited, or when this is the user's first run.
Full-run is fine for general / traffic / warehouse once the user has previously verified output quality on the same data type.
The pipeline has built-in resume, so a pilot followed by a full run does not re-process the pilot videos.

Quick start

The pipeline runs inside the TAO Toolkit container via the auto_label CLI:

auto_label generate -e /path/to/spec.yaml \
    results_dir=/results \
    video_reasoning_annotation.data.video_root=/videos \
    video_reasoning_annotation.vlm.gemini.api_key=$GOOGLE_API_KEY \
    video_reasoning_annotation.workflow.mode=auto

Generate a default spec to start from:

auto_label default_specs results_dir=/results module_name=auto_label
# then set:  autolabel_type: "video_reasoning_annotation"

All fields support Hydra dot-notation overrides on the command line. For the full YAML reference (every field, model/endpoint setup, error patterns), see references/configuration.md.

Pilot workflow

Use this when running a 5–10 video pilot:

Run the pipeline on the pilot subset with the chosen prompts_module and workflow.mode.
Inspect results_dir/step_1a_caption/captions.jsonl — captions accurate, capturing the right level of detail?
Inspect results_dir/step_3_qa/qa_output.jsonl — questions meaningful, answers correct, reasoning logical?
If quality is insufficient: adjust the prompts (in prompts_module if domain-customized, or fall back to general if a domain module is over-tuned), and re-run. The pipeline auto-skips already-processed videos.
Once satisfied, scale to the full dataset by pointing data.video_root (or data.input_jsonl_files) at the full set and re-running with the same results_dir (resume) or a fresh one (full re-run).

Quality compounds downstream — bad captions produce bad descriptions which produce bad QA. Focus iteration on Step 1a/1b output first; descriptions and QA usually improve once captions are right.

Configuration summary

Key fields (full reference in references/configuration.md):

Field	Default	Description
`workflow.steps`	`["0","1a","1b","1c","2","3","4"]`	Which pipeline steps to execute
`workflow.mode`	`"auto"`	`"auto"`, `"anomaly"`, or `"normal"`
`vlm.backend`	`"gemini"`	`"gemini"` or `"openai"` (OpenAI-compatible)
`llm.backend`	`"gemini"`	Same options; text-only, cheaper model works
`workflow.max_workers`	`4`	Parallel threads per step (watch API rate limits)
`license`	`""`	Optional: written to `metadata.license` in step 4 outputs (e.g. `"CC-BY-4.0"`)
`description_extra`	`""`	Optional: extra text appended to per-task descriptions in step 4 metadata
`prompts_module`	`""`	Dotted import path to custom prompts module

Prompts

Built-in (general): nvidia_tao_ds.auto_label.video_reasoning_annotation.prompts — domain-agnostic, used by default.
Template: nvidia_tao_ds.auto_label.video_reasoning_annotation.prompt_template — same 26 keys with [PLACEHOLDER] markers for domain customization.
Reference modules (working examples for the consultation's traffic / warehouse branches): references/prompts_traffic.py, references/prompts_warehouse.py.
Custom domains: see references/domain_adaptation.md for the full workshop and placeholder reference.

Inputs

video_root: Directory of videos (walked recursively for .mp4, .avi, .mov, .mkv).
input_jsonl_files: List of JSONL files with {"video_path": "..."} per line. The video key is also accepted; extra fields are allowed.
filter_field: Optional boolean field to filter JSONL entries.

Provide video_root, input_jsonl_files, or both (lists merge).

Outputs

All outputs go to results_dir/ with per-step subdirectories (step_0_filter/, step_1a_caption/, …, step_4_output/):

Steps 0–3: JSONL — one JSON object per video per line.
Step 4: One <task>.json per non-empty task type, in the tao-vl-reason-v1.0 envelope. Up to 10 files: mcq.json, mcq_openended.json, bcq.json, bcq_openended.json, open_qa.json, causal_linkage.json, temporal_localization.json, temporal_description.json, scene_description.json, video_summarization.json.

Each step 4 file looks like:

{
  "format": "tao-vl-reason-v1.0",
  "metadata": {"type": "annotation", "task": "<task>", "date": "YYYY-MM-DD",
               "description": "<per-task + description_extra>", "license": "<from config>"},
  "media_root": "<data.video_root>" | null,
  "items": [{"video_id": "...", "question": "...", "answer": "...", "reasoning": "..."}, ...]
}

media_root mirrors data.video_root (or null when unset); each item's video_id is the entry's video path with the video_root prefix stripped. Set license and description_extra in the spec to populate the metadata.

Prerequisites

Container: tao_toolkit.pyt (resolves to nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt via versions.yaml).
ffmpeg / ffprobe: required for chunk captioning (Step 1b) and highlight extraction (Step 1c).
VLM endpoint: at least one — Gemini API key or OpenAI-compatible endpoint.

Tao Generate Video Reasoning Annotations