Tao Generate Image Grounding

Two-step image grounding pipeline: extracts referring expressions from (image, caption) pairs and grounds them to pixel-space bounding boxes via a VLM. Use when the user wants to ground captions to bboxes, generate phrase-grounded annotations, auto-label images for grounding, or run the image_grounding pipeline. Triggers include 'image grounding', 'phrase grounding', 'ground captions', 'auto-label image grounding', 'image_grounding'.

Published by @NVIDIA·0 agent reads / 30d·0 saves·

Image Grounding Pipeline

Turn (image, caption) pairs into per-image grounded annotations: cleaned captions, referring expressions with character spans, and pixel-space bounding boxes for each expression. A single VLM (Gemini or any OpenAI-compatible endpoint) handles both steps.

Purpose

Generate phrase-grounded training data for referring-expression and grounding models. The VLM acts as a "teacher" annotator: Step 0 extracts referring expressions from the caption while looking at the image; Step 1 returns one bbox set per expression for each image.

Pipeline Architecture

Step 0: Expression extraction  → VLM cleans caption, extracts referring expressions + char spans
Step 1: Phrase grounding       → VLM returns pixel bboxes + scores per expression

Steps are individually selectable via workflow.steps. Each step writes a per-sample checkpoint to step_<N>_*/.ckpt/<sample_id>.json and skips already-processed records on re-run. Set workflow.force_reprocess: true to ignore checkpoints and reprocess from scratch.

Instructions

Initial setup

When a user wants to run this pipeline, walk through these steps:

  1. Input JSONL: Ask for the JSONL path. Each line must be one object like {"image_path": "...", "caption": "..."}. image_path can be absolute or relative.

  2. Image root: If any image_path values are relative, set data.image_root to the directory they should resolve from.

  3. API access: Ask the user which VLM endpoint they want to use. Present these five options and act on the choice:

    1. Gemini — set vlm.backend: "gemini"; require GOOGLE_API_KEY (env var or vlm.gemini.api_key).
    2. NIM (e.g. https://inference-api.nvidia.com/v1) — set vlm.backend: "openai"; collect base_url, model_name, and api_key.
    3. TAO inference microservice (self-hosted, OpenAI-compatible). Confirm whether the server is already running:
      • Running — collect base_url, model_name, and (optionally) api_key; set vlm.backend: "openai".
      • Not running — guide the user through the skills/applications/tao-run-inference-service skill, which stands up a local TAO inference microservice with an OpenAI-compatible API. Before promising a specific model, check skills/applications/tao-run-inference-service/references/service.yaml for valid_network_arch_config_basenames. Once the server is up, collect base_url, model_name, and (optionally) api_key; set vlm.backend: "openai".
    4. vLLM (self-hosted, OpenAI-compatible). Confirm whether the server is already running:
      • Running — collect base_url, model_name, and (optionally) api_key; set vlm.backend: "openai".
      • Not running — follow references/vllm_server.md to install and launch a vLLM server, then collect base_url, model_name, and (optionally) api_key; set vlm.backend: "openai".
    5. Custom (any other OpenAI-compatible endpoint) — set vlm.backend: "openai"; collect base_url, model_name, and (optionally) api_key.

    If the user has no endpoint and does not want to set one up, stop and help resolve API access first.

  4. Workflow steps: Choose one of:

    • Full pipeline: ["0", "1"]
    • Expression extraction only: ["0"]
    • Grounding only: ["1"], which requires existing step-0 output at results_dir/step_0_expression_extraction/annotations.jsonl
  5. Resume vs fresh run: By default, the workflow reuses checkpoints and skips completed records. To reprocess everything, set image_grounding.workflow.force_reprocess=true.

Running the pipeline

The pipeline runs inside the TAO Toolkit container via the auto_label CLI:

auto_label generate -e /path/to/spec.yaml \
    results_dir=/results \
    image_grounding.data.input_jsonl=/data/captions.jsonl \
    image_grounding.data.image_root=/data/images \
    image_grounding.vlm.gemini.api_key=$GOOGLE_API_KEY

Generate a default spec: auto_label default_specs results_dir=/results module_name=auto_label, then set autolabel_type: "image_grounding". All fields support Hydra dot-notation overrides on the command line.

See references/configuration.md for the full YAML structure, all parameters, model/endpoint setup, and error patterns.

Recommended pilot workflow

  1. Run on 5-10 images with both steps
  2. Inspect step_0_expression_extraction/annotations.jsonl — are cleaned_caption and expressions[] accurate? Are the right noun phrases captured?
  3. Inspect step_1_grounding/annotations.jsonl — do the bboxes in expressions[].instances[] look right? Are confidence scores reasonable?
  4. If quality is insufficient, switch the VLM to a stronger model (e.g. gemini-2.5-pro) or raise media_resolution/max_output_tokens, then re-run with force_reprocess=true.
  5. Scale to the full dataset once satisfied.

Configuration

Key configuration fields (full reference in references/configuration.md):

FieldDefaultDescription
workflow.steps["0","1"]Which pipeline steps to execute ("0" = expressions, "1" = grounding)
workflow.max_workers4Parallel threads per step (watch API rate limits)
workflow.force_reprocessfalseIgnore per-sample checkpoints and reprocess from scratch
vlm.backend"gemini""gemini" or "openai" (OpenAI-compatible endpoint)
data.input_jsonlrequiredPath to input JSONL with image_path + caption per line
data.image_root""Optional prefix for resolving relative image_path entries

Inputs

A single JSONL file at data.input_jsonl. One JSON object per line:

FieldRequiredDescription
image_pathyesAbsolute path, or relative path resolved against data.image_root
captionyesFree-text caption for the image
image_idnoStable identifier; auto-derived from the filename if missing
width, heightnoImage dimensions in pixels; default to 1920×1080 for bbox clamping if missing

Outputs

All outputs go to results_dir/:

  • step_0_expression_extraction/annotations.jsonl — per-record output enriched with cleaned_caption and expressions[] (each with text, expression_id, char_span, noun_chunk, empty instances[]).
  • step_1_grounding/annotations.jsonl — same records with expressions[].instances[] filled in (each instance has bbox: [x1,y1,x2,y2] in pixel space, score in [0.0, 1.0], and bbox_id).
  • results_dir/annotations.jsonl — copy of the last step's output for convenience.
  • step_<N>_*/.ckpt/<sample_id>.json — per-sample checkpoints used for resume.

Prerequisites

  • Container: nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt
  • API access: At least one VLM endpoint (Gemini API key or OpenAI-compatible endpoint capable of image input)

Bundled with this artifact

7 files

Reference files that ship alongside this artifact. Agents pull these in only when the task needs them.

More on the bench

SKILL0

Whisper

OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR.

data-science-ml+2
0
SKILL0

Guidance

Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance - Microsoft Research's constrained generation framework

ai-prompt-engineering+2
0
SKILL0

Pinecone

Managed vector database for production AI applications. Fully managed, auto-scaling, with hybrid search (dense + sparse), metadata filtering, and namespaces. Low latency (<100ms p95). Use for production RAG, recommendation systems, or semantic search at scale. Best for serverless, managed infrastructure.

data-science-ml+2
0