Tao Analyze Gaps Vlm Bcq

Extract false-positive and false-negative gaps from VLM binary-classification-question (BCQ, yes/no) predictions. Use after running VLM evaluation when you have a predictions JSON and need to identify failure cases for DEFT root cause analysis on a binary-classification VLM workflow.

Published by @NVIDIA·0 agent reads / 30d·0 saves·

VLM Binary Classification Gap Analysis

Reads a VLM predictions JSON, compares each model response against ground truth, and writes FP/FN failure cases to a JSONL file with a summary report.

Purpose

After running a VLM on a binary yes/no evaluation task, the predictions need to be compared against ground truth to identify failure cases. This skill produces a structured list of FP (false positive) and FN (false negative) samples that downstream RCCA stages (e.g., cosmos generation, root cause analysis) consume to drive a DEFT iteration.

Usage

Invoke the vlm_bcq action inside the TAO Toolkit data services container with Hydra-style key=value overrides:

gap_analysis vlm_bcq \
  predictions_json=/path/to/results.json \
  results_dir=/path/to/output/gaps

Include videos_dir when video_id values in the predictions are relative paths:

gap_analysis vlm_bcq \
  predictions_json=/path/to/results.json \
  results_dir=/path/to/output/gaps \
  videos_dir=/path/to/videos/root

After the run, surface the FP/FN counts from kpi_gaps_report.txt and point downstream stages at kpi_gaps.jsonl.

Inputs

  • predictions_json: Path to predictions JSON file. Must be a JSON array where each item has video_id, response, and gt fields. response and gt are parsed with word-boundary matching — 'yes' or 'no' anywhere in the string is recognized. Samples where both or neither are present are skipped with a warning.
  • videos_dir (optional): Base directory for resolving relative video_id paths. If omitted, video_id values are used as absolute paths.

Predictions JSON format:

[
  {
    "video_id": "/path/to/video.mp4",
    "response": "Yes, there is a collision.",
    "gt": "B. No",
    "question": "Is there a collision?"
  }
]

Outputs

  • kpi_gaps.jsonl: One JSON object per line for each FP/FN case. Fields: video_id (absolute path), error_type (FP or FN), question, ground_truth, response.
  • kpi_gaps_report.txt: Human-readable table with total FP/FN counts.

If no gaps are found, no files are written and a message is logged.

Key Parameters

ParameterRequiredDescription
predictions_jsonYesPath to predictions JSON file
results_dirYesOutput directory; created if it does not exist
videos_dirNoBase directory for resolving relative video_id paths

Error Patterns

ErrorCauseFix
FileNotFoundErrorpredictions_json does not existCheck the path
ValueError: must be a JSON arrayPredictions file is not a listWrap predictions in [...]
ValueError: missing 'gt'/'response'/'video_id'A prediction item is missing a required fieldInspect and fix the predictions JSON
Samples silently skippedresponse or gt contains both or neither 'yes'/'no'Check logs for warnings; inspect those samples

Bundled with this artifact

5 files

Reference files that ship alongside this artifact. Agents pull these in only when the task needs them.

More on the bench

SKILL0

Whisper

OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR.

data-science-ml+2
0
SKILL0

Guidance

Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance - Microsoft Research's constrained generation framework

ai-prompt-engineering+2
0
SKILL0

Pinecone

Managed vector database for production AI applications. Fully managed, auto-scaling, with hybrid search (dense + sparse), metadata filtering, and namespaces. Low latency (<100ms p95). Use for production RAG, recommendation systems, or semantic search at scale. Best for serverless, managed infrastructure.

data-science-ml+2
0