Image Grounding Pipeline

Turn (image, caption) pairs into per-image grounded annotations: cleaned captions, referring expressions with character spans, and pixel-space bounding boxes for each expression. A single VLM (Gemini or any OpenAI-compatible endpoint) handles both steps.

Purpose

Generate phrase-grounded training data for referring-expression and grounding models. The VLM acts as a "teacher" annotator: Step 0 extracts referring expressions from the caption while looking at the image; Step 1 returns one bbox set per expression for each image.

Pipeline Architecture

Step 0: Expression extraction  → VLM cleans caption, extracts referring expressions + char spans
Step 1: Phrase grounding       → VLM returns pixel bboxes + scores per expression

Steps are individually selectable via workflow.steps. Each step writes a per-sample checkpoint to step_<N>_*/.ckpt/<sample_id>.json and skips already-processed records on re-run. Set workflow.force_reprocess: true to ignore checkpoints and reprocess from scratch.

Instructions

Initial setup

When a user wants to run this pipeline, walk through these steps:

Input JSONL: Ask for the JSONL path. Each line must be one object like {"image_path": "...", "caption": "..."}. image_path can be absolute or relative.
Image root: If any image_path values are relative, set data.image_root to the directory they should resolve from.
API access: Ask the user which VLM endpoint they want to use. Present these five options and act on the choice:
1. Gemini — set vlm.backend: "gemini"; require GOOGLE_API_KEY (env var or vlm.gemini.api_key).
2. NIM (e.g. https://inference-api.nvidia.com/v1) — set vlm.backend: "openai"; collect base_url, model_name, and api_key.
3. TAO inference microservice (self-hosted, OpenAI-compatible). Confirm whether the server is already running:
  - Running — collect base_url, model_name, and (optionally) api_key; set vlm.backend: "openai".
  - Not running — guide the user through the skills/applications/tao-run-inference-service skill, which stands up a local TAO inference microservice with an OpenAI-compatible API. Before promising a specific model, check skills/applications/tao-run-inference-service/references/service.yaml for valid_network_arch_config_basenames. Once the server is up, collect base_url, model_name, and (optionally) api_key; set vlm.backend: "openai".
4. vLLM (self-hosted, OpenAI-compatible). Confirm whether the server is already running:
  - Running — collect base_url, model_name, and (optionally) api_key; set vlm.backend: "openai".
  - Not running — follow references/vllm_server.md to install and launch a vLLM server, then collect base_url, model_name, and (optionally) api_key; set vlm.backend: "openai".
5. Custom (any other OpenAI-compatible endpoint) — set vlm.backend: "openai"; collect base_url, model_name, and (optionally) api_key.
If the user has no endpoint and does not want to set one up, stop and help resolve API access first.
Workflow steps: Choose one of:
- Full pipeline: ["0", "1"]
- Expression extraction only: ["0"]
- Grounding only: ["1"], which requires existing step-0 output at results_dir/step_0_expression_extraction/annotations.jsonl
Resume vs fresh run: By default, the workflow reuses checkpoints and skips completed records. To reprocess everything, set image_grounding.workflow.force_reprocess=true.

Running the pipeline

The pipeline runs inside the TAO Toolkit container via the auto_label CLI:

auto_label generate -e /path/to/spec.yaml \
    results_dir=/results \
    image_grounding.data.input_jsonl=/data/captions.jsonl \
    image_grounding.data.image_root=/data/images \
    image_grounding.vlm.gemini.api_key=$GOOGLE_API_KEY

Generate a default spec: auto_label default_specs results_dir=/results module_name=auto_label, then set autolabel_type: "image_grounding". All fields support Hydra dot-notation overrides on the command line.

See references/configuration.md for the full YAML structure, all parameters, model/endpoint setup, and error patterns.

Recommended pilot workflow

Run on 5-10 images with both steps
Inspect step_0_expression_extraction/annotations.jsonl — are cleaned_caption and expressions[] accurate? Are the right noun phrases captured?
Inspect step_1_grounding/annotations.jsonl — do the bboxes in expressions[].instances[] look right? Are confidence scores reasonable?
If quality is insufficient, switch the VLM to a stronger model (e.g. gemini-2.5-pro) or raise media_resolution/max_output_tokens, then re-run with force_reprocess=true.
Scale to the full dataset once satisfied.

Configuration

Key configuration fields (full reference in references/configuration.md):

Field	Default	Description
`workflow.steps`	`["0","1"]`	Which pipeline steps to execute (`"0"` = expressions, `"1"` = grounding)
`workflow.max_workers`	`4`	Parallel threads per step (watch API rate limits)
`workflow.force_reprocess`	`false`	Ignore per-sample checkpoints and reprocess from scratch
`vlm.backend`	`"gemini"`	`"gemini"` or `"openai"` (OpenAI-compatible endpoint)
`data.input_jsonl`	required	Path to input JSONL with `image_path` + `caption` per line
`data.image_root`	`""`	Optional prefix for resolving relative `image_path` entries

Inputs

A single JSONL file at data.input_jsonl. One JSON object per line:

Field	Required	Description
`image_path`	yes	Absolute path, or relative path resolved against `data.image_root`
`caption`	yes	Free-text caption for the image
`image_id`	no	Stable identifier; auto-derived from the filename if missing
`width`, `height`	no	Image dimensions in pixels; default to `1920×1080` for bbox clamping if missing

Outputs

All outputs go to results_dir/:

step_0_expression_extraction/annotations.jsonl — per-record output enriched with cleaned_caption and expressions[] (each with text, expression_id, char_span, noun_chunk, empty instances[]).
step_1_grounding/annotations.jsonl — same records with expressions[].instances[] filled in (each instance has bbox: [x1,y1,x2,y2] in pixel space, score in [0.0, 1.0], and bbox_id).
results_dir/annotations.jsonl — copy of the last step's output for convenience.
step_<N>_*/.ckpt/<sample_id>.json — per-sample checkpoints used for resume.

Prerequisites

Container: nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt
API access: At least one VLM endpoint (Gemini API key or OpenAI-compatible endpoint capable of image input)

Tao Generate Image Grounding

Image Grounding Pipeline

Purpose

Pipeline Architecture

Instructions

Initial setup

Running the pipeline

Recommended pilot workflow

Configuration

Inputs

Outputs

Prerequisites

Bundled with this artifact

More on the bench

Whisper

Guidance

Pinecone