Image Grounding Pipeline
Turn (image, caption) pairs into per-image grounded annotations: cleaned captions, referring expressions with character spans, and pixel-space bounding boxes for each expression. A single VLM (Gemini or any OpenAI-compatible endpoint) handles both steps.
Purpose
Generate phrase-grounded training data for referring-expression and grounding models. The VLM acts as a "teacher" annotator: Step 0 extracts referring expressions from the caption while looking at the image; Step 1 returns one bbox set per expression for each image.
Pipeline Architecture
Step 0: Expression extraction → VLM cleans caption, extracts referring expressions + char spans
Step 1: Phrase grounding → VLM returns pixel bboxes + scores per expression
Steps are individually selectable via workflow.steps. Each step writes a per-sample checkpoint to step_<N>_*/.ckpt/<sample_id>.json and skips already-processed records on re-run. Set workflow.force_reprocess: true to ignore checkpoints and reprocess from scratch.
Instructions
Initial setup
When a user wants to run this pipeline, walk through these steps:
-
Input JSONL: Ask for the JSONL path. Each line must be one object like
{"image_path": "...", "caption": "..."}.image_pathcan be absolute or relative. -
Image root: If any
image_pathvalues are relative, setdata.image_rootto the directory they should resolve from. -
API access: Ask the user which VLM endpoint they want to use. Present these five options and act on the choice:
- Gemini — set
vlm.backend: "gemini"; requireGOOGLE_API_KEY(env var orvlm.gemini.api_key). - NIM (e.g.
https://inference-api.nvidia.com/v1) — setvlm.backend: "openai"; collectbase_url,model_name, andapi_key. - TAO inference microservice (self-hosted, OpenAI-compatible). Confirm whether the server is already running:
- Running — collect
base_url,model_name, and (optionally)api_key; setvlm.backend: "openai". - Not running — guide the user through the
skills/applications/tao-run-inference-serviceskill, which stands up a local TAO inference microservice with an OpenAI-compatible API. Before promising a specific model, checkskills/applications/tao-run-inference-service/references/service.yamlforvalid_network_arch_config_basenames. Once the server is up, collectbase_url,model_name, and (optionally)api_key; setvlm.backend: "openai".
- Running — collect
- vLLM (self-hosted, OpenAI-compatible). Confirm whether the server is already running:
- Running — collect
base_url,model_name, and (optionally)api_key; setvlm.backend: "openai". - Not running — follow references/vllm_server.md to install and launch a vLLM server, then collect
base_url,model_name, and (optionally)api_key; setvlm.backend: "openai".
- Running — collect
- Custom (any other OpenAI-compatible endpoint) — set
vlm.backend: "openai"; collectbase_url,model_name, and (optionally)api_key.
If the user has no endpoint and does not want to set one up, stop and help resolve API access first.
- Gemini — set
-
Workflow steps: Choose one of:
- Full pipeline:
["0", "1"] - Expression extraction only:
["0"] - Grounding only:
["1"], which requires existing step-0 output atresults_dir/step_0_expression_extraction/annotations.jsonl
- Full pipeline:
-
Resume vs fresh run: By default, the workflow reuses checkpoints and skips completed records. To reprocess everything, set
image_grounding.workflow.force_reprocess=true.
Running the pipeline
The pipeline runs inside the TAO Toolkit container via the auto_label CLI:
auto_label generate -e /path/to/spec.yaml \
results_dir=/results \
image_grounding.data.input_jsonl=/data/captions.jsonl \
image_grounding.data.image_root=/data/images \
image_grounding.vlm.gemini.api_key=$GOOGLE_API_KEY
Generate a default spec: auto_label default_specs results_dir=/results module_name=auto_label, then set autolabel_type: "image_grounding". All fields support Hydra dot-notation overrides on the command line.
See references/configuration.md for the full YAML structure, all parameters, model/endpoint setup, and error patterns.
Recommended pilot workflow
- Run on 5-10 images with both steps
- Inspect
step_0_expression_extraction/annotations.jsonl— arecleaned_captionandexpressions[]accurate? Are the right noun phrases captured? - Inspect
step_1_grounding/annotations.jsonl— do the bboxes inexpressions[].instances[]look right? Are confidence scores reasonable? - If quality is insufficient, switch the VLM to a stronger model (e.g.
gemini-2.5-pro) or raisemedia_resolution/max_output_tokens, then re-run withforce_reprocess=true. - Scale to the full dataset once satisfied.
Configuration
Key configuration fields (full reference in references/configuration.md):
| Field | Default | Description |
|---|---|---|
workflow.steps | ["0","1"] | Which pipeline steps to execute ("0" = expressions, "1" = grounding) |
workflow.max_workers | 4 | Parallel threads per step (watch API rate limits) |
workflow.force_reprocess | false | Ignore per-sample checkpoints and reprocess from scratch |
vlm.backend | "gemini" | "gemini" or "openai" (OpenAI-compatible endpoint) |
data.input_jsonl | required | Path to input JSONL with image_path + caption per line |
data.image_root | "" | Optional prefix for resolving relative image_path entries |
Inputs
A single JSONL file at data.input_jsonl. One JSON object per line:
| Field | Required | Description |
|---|---|---|
image_path | yes | Absolute path, or relative path resolved against data.image_root |
caption | yes | Free-text caption for the image |
image_id | no | Stable identifier; auto-derived from the filename if missing |
width, height | no | Image dimensions in pixels; default to 1920×1080 for bbox clamping if missing |
Outputs
All outputs go to results_dir/:
step_0_expression_extraction/annotations.jsonl— per-record output enriched withcleaned_captionandexpressions[](each withtext,expression_id,char_span,noun_chunk, emptyinstances[]).step_1_grounding/annotations.jsonl— same records withexpressions[].instances[]filled in (each instance hasbbox: [x1,y1,x2,y2]in pixel space,scorein[0.0, 1.0], andbbox_id).results_dir/annotations.jsonl— copy of the last step's output for convenience.step_<N>_*/.ckpt/<sample_id>.json— per-sample checkpoints used for resume.
Prerequisites
- Container:
nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt - API access: At least one VLM endpoint (Gemini API key or OpenAI-compatible endpoint capable of image input)