Image Referring Expression Pipeline

Generate referring-expression and grounding annotations from images with KITTI-format bounding box labels. A single VLM (Gemini or any OpenAI-compatible endpoint) runs four steps: per-object region descriptions, holistic image captions, grouped grounding expressions tied to bboxes, and an optional double-check verification pass.

Purpose

Transform (image, KITTI labels) pairs into a unified annotations.jsonl containing rich, grounded referring expressions. The VLM acts as a "teacher" annotator: Steps 0-1 see the image; Step 2 groups Step 0 outputs into grouping phrases with bbox lists; Step 3 (optional) re-examines those bboxes against the image and corrects mismatches.

Pipeline Architecture

Step 0: Region expression  ──┐
                              ├──▶  Step 2: Grounding expression  ──▶  [Step 3: Double check]
Step 1: Image caption  ──────┘                                                   (optional)

Step 0 (region_expr) — VLM emits one short discriminative phrase per KITTI bbox (bbox_2d, type, color, description).
Step 1 (image_caption) — VLM emits a holistic, location-agnostic scene caption.
Step 2 (grounding_expr) — VLM groups Step 0 objects into grouping phrases and returns one bbox list per group, optionally using Step 1's caption as extra context.
Step 3 (double_check) — VLM re-checks each Step 2 bbox against the image; bad matches are removed, slightly-off boxes get tightened.

Steps 0 and 1 run in parallel within a single thread pool (they only depend on the seed records). Each step writes its own step_<N>_*/annotations.jsonl and skips already-processed images on re-run unless workflow.force_reprocess: true.

Instructions

Initial setup

When a user wants to run this pipeline, walk through these steps:

Images: Ask for data.image_dir, the directory containing .jpg, .jpeg, or .png images.
KITTI labels: Ask for data.kitti_label_dir, the directory containing one .txt label file per image. Each label line must use KITTI format: <type> <truncated> <occluded> <alpha> <bbox_left> <bbox_top> <bbox_right> <bbox_bottom> .... Lines with fewer than 8 fields are silently skipped. Set this even for Step 1-only runs because Steps 0 and 2 require it.
Resume from existing annotations: If the user already has a unified annotations.jsonl from a previous run, set data.input_annotations_jsonl to that file instead of seeding from data.image_dir and data.kitti_label_dir.
API access: Ask the user which VLM endpoint they want to use. Present these five options and act on the choice:
1. Gemini — set vlm.backend: "gemini"; require GOOGLE_API_KEY (env var or vlm.gemini.api_key).
2. NIM (e.g. https://inference-api.nvidia.com/v1) — set vlm.backend: "openai"; collect base_url, model_name, and api_key.
3. TAO inference microservice (self-hosted, OpenAI-compatible). Confirm whether the server is already running:
  - Running — collect base_url, model_name, and (optionally) api_key; set vlm.backend: "openai".
  - Not running — guide the user through the skills/applications/tao-run-inference-service skill, which stands up a local TAO inference microservice with an OpenAI-compatible API. Before promising a specific model, check skills/applications/tao-run-inference-service/references/service.yaml for valid_network_arch_config_basenames. Once the server is up, collect base_url, model_name, and (optionally) api_key; set vlm.backend: "openai".
4. vLLM (self-hosted, OpenAI-compatible). Confirm whether the server is already running:
  - Running — collect base_url, model_name, and (optionally) api_key; set vlm.backend: "openai".
  - Not running — follow references/vllm_server.md to install and launch a vLLM server, then collect base_url, model_name, and (optionally) api_key; set vlm.backend: "openai".
5. Custom (any other OpenAI-compatible endpoint) — set vlm.backend: "openai"; collect base_url, model_name, and (optionally) api_key.
If the user has no endpoint and does not want to set one up, stop and help resolve API access first.
Workflow steps: Choose one of:
- Full pipeline: ["0", "1", "2", "3"]
- No caption generation: ["0", "2", "3"], where Step 2 falls back to image-only context
- No verification: ["0", "1", "2"]
- Custom subset: any supported subset of steps
Output format: Choose one of:
- jsonl: unified schema only
- legacy: byte-compatible .txt.stepN files only
- both: writes both formats and is the default for downstream tooling

Running the pipeline

The pipeline runs inside the TAO Toolkit container via the auto_label CLI:

auto_label generate -e /path/to/spec.yaml \
    results_dir=/results \
    image_referring_expression.data.image_dir=/data/images \
    image_referring_expression.data.kitti_label_dir=/data/labels \
    image_referring_expression.vlm.gemini.api_key=$GOOGLE_API_KEY

Generate a default spec: auto_label default_specs results_dir=/results module_name=auto_label, then set autolabel_type: "image_referring_expression". All fields support Hydra dot-notation overrides on the command line.

See references/configuration.md for the full YAML structure, all parameters, model/endpoint setup, and error patterns.

Recommended pilot workflow

Run on 5-10 images with all four steps.
Inspect step_0_region_expr/annotations.jsonl — are object types, colors, and discriminating phrases accurate?
Inspect step_2_grounding_expr/annotations.jsonl — are objects grouped sensibly, and do bbox coordinates match the described groups?
Inspect step_3_double_check/annotations.jsonl — were mismatched bboxes removed or tightened? Are any new errors introduced (rare)?
If quality is insufficient, switch the VLM to a stronger model (e.g. gemini-2.5-pro or a larger Qwen3-VL endpoint), raise media_resolution / max_output_tokens, then re-run with workflow.force_reprocess=true.
Scale to the full dataset once satisfied.

Configuration

Key configuration fields (full reference in references/configuration.md):

Field	Default	Description
`workflow.steps`	`["0","1","2","3"]`	Which steps to execute (`0`=region_expr, `1`=image_caption, `2`=grounding_expr, `3`=double_check)
`workflow.max_workers`	`4`	Parallel threads per step (watch API rate limits)
`workflow.force_reprocess`	`false`	Ignore cached per-step outputs and reprocess from scratch
`workflow.output_format`	`"jsonl"` (set to `"both"` in the default spec)	`"jsonl"`, `"legacy"`, or `"both"`
`vlm.backend`	`"gemini"`	`"gemini"` or `"openai"` (OpenAI-compatible endpoint)
`data.image_dir`	required	Directory of input images (`.jpg` / `.jpeg` / `.png`)
`data.kitti_label_dir`	required (unless resuming)	Directory of KITTI-format `.txt` label files
`data.input_annotations_jsonl`	`""`	Optional pre-seeded `annotations.jsonl` (skips KITTI seeding)

Inputs

Two ways to seed the pipeline:

Image directory + KITTI labels (default). Set data.image_dir and data.kitti_label_dir. The orchestrator walks the image directory, reads the matching <stem>.txt KITTI file, parses bboxes (fields 0 + 4-7), reads each image's width/height via PIL, and writes a seed_annotations.jsonl to results_dir/.
Pre-seeded annotations JSONL (resume / pre-computed regions). Set data.input_annotations_jsonl to a file with one {"image_id", "image_path", "width", "height", "kitti_bboxes": [...]} object per line.

Outputs

All outputs go to results_dir/:

seed_annotations.jsonl — initial per-image records (unless input_annotations_jsonl was supplied).
step_0_region_expr/annotations.jsonl — adds regions[] (each with bbox/bbox_2d, type, color, description).
step_1_image_caption/annotations.jsonl — adds caption (string).
step_2_grounding_expr/annotations.jsonl — adds expressions[] (each {text, instances: [{bbox: [x1,y1,x2,y2]}]}).
step_3_double_check/annotations.jsonl — same shape as Step 2, with bboxes removed/updated.
results_dir/annotations.jsonl — copy of the last completed step's output.
When workflow.output_format is "legacy" or "both", each step also writes byte-compatible step_<N>_*/labels/<stem>.txt.stepN files for the original 2d-data-engine tooling.

Prerequisites

Container: nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt
API access: At least one VLM endpoint (Gemini API key or OpenAI-compatible endpoint capable of image input)
PIL / Pillow: Required to read image dimensions during seeding (already present in the TAO container)

Tao Generate Referring Expressions

Image Referring Expression Pipeline

Purpose

Pipeline Architecture

Instructions

Initial setup

Running the pipeline

Recommended pilot workflow

Configuration

Inputs

Outputs

Prerequisites

Bundled with this artifact

More on the bench

Whisper

Guidance

Pinecone