Tao Mine Aoi Images

Runs the DEFT embed-then-mine workflow for VCN AOI iterations — embeds the gap-analysis target parquet, embeds a source pool, and mines nearest-neighbour source images for downstream augmentation. Use as the immediate next step after `tao-route-visual-changenet-samples` when expanding a real-image augmentation queue from the mining subset.

Published by @NVIDIA·from NVIDIA/skills·0 agent reads / 30d·0 saves·

DEFT Mining and Embedding Skill

You are the operator of the DEFT embed-then-mine workflow for VCN AOI. Your job is to take a parquet of weak target images (the gap-analysis or routing output) and a source pool, then produce a deduplicated parquet of mined source images that look similar to the targets — ready to feed into the next training round.

The workflow is fixed and deterministic: embed the targets, embed the source pool, then mine nearest neighbours. Each step's output parquet is the next step's input. There is no iterative search, no clustering pass, no human-in-the-loop selection — depth comes from picking the right encoder and the right topn, not from a multi-phase investigation.

The whole skill is a thin wrapper around three direct docker run invocations against the tao_toolkit.data_services image declared in versions.yaml (resolved at runtime — see Setup). The container's entrypoint takes <category> <action> -e <spec.yaml> [hydra overrides...]: embedding image_embeddings -e <embedding_spec.yaml> … for embedding and tmm nearest_neighbors -e <mining_spec.yaml> … for mining. The -e flag points at a YAML of schema defaults; anything afterward is a bare Hydra override (key=value) applied per run. There is no dataset keyword inside the container — that's the TAO launcher's pillar prefix and is dropped here. Schema keys can rename between data-services releases, so when in doubt introspect once per image with docker run --rm "$DS_IMAGE" embedding image_embeddings --cfg=job and ... tmm nearest_neighbors --cfg=job. See references/invocation.md for the full entrypoint contract, --cfg=job introspection, and the paste-and-edit end-to-end recipe.


Inputs

  1. Target parquet — the gap-analysis output, typically mining_gaps.parquet from tao-route-visual-changenet-samples (or gaps.parquet from tao-analyze-gaps-visual-changenet if routing was skipped). Required column: filepath. If label is also present, label-aware filtering during mining is available; otherwise the mining task silently no-ops the filter.
  2. Source pool — a parquet of candidate images to mine against, with a filepath column. If the user only has a CSV, convert it to a parquet with the same columns before Step 2. For label-aware filtering, the pool must also carry a label column.
  3. Embedding spec file — a YAML containing model, model_path, batch_size, and (only when model_path is a TAO .pth/.ckpt) model_config_path. Reused across Steps 1 and 2; input_parquet/output_parquet are supplied per run as Hydra overrides. The same spec MUST drive both embedding steps — embeddings from different encoders are not comparable, and mismatched encoders are the most common cause of "the mined images look unrelated" reports.
  4. Mining spec file — a YAML containing topn, knn_metric, filter_by_label, and (rarely changed) source_embed_column_name/target_embed_column_name. source_parquet/target_parquet/output_parquet are Hydra overrides at run time. SigLIP and CLIP embeddings should use knn_metric: cosine. When filter_by_label: true but either embedding parquet lacks a label column, the container logs a warning and proceeds without filtering.

Setup

The mining and embedding tasks live inside the tao_toolkit.data_services image declared in versions.yaml. Resolve the concrete URI once at the top of the run, then confirm Docker, the NVIDIA container toolkit, and a GPU are present before anything else:

# Resolve tao_toolkit.data_services → concrete nvcr.io/... URI from versions.yaml
DS_IMAGE=$(python3 -c "import yaml,os; print(yaml.safe_load(open(os.environ['TAO_SKILL_BANK_PATH']+'/versions.yaml'))['images']['tao_toolkit']['data_services'])")
echo "DS_IMAGE=$DS_IMAGE"

docker info > /dev/null && echo "OK: docker"
nvidia-smi > /dev/null && echo "OK: GPU"
docker image inspect "$DS_IMAGE" > /dev/null \
  || docker pull "$DS_IMAGE"

TAO_SKILL_BANK_PATH is exported by the plugin's session_start hook. If it is unset (e.g. running outside the Claude Code plugin), point it at the skill-bank repo root before resolving. A GPU is required for both the encoder forward pass and the cuML/cuDF k-NN search; both steps will fail without CUDA.

Path mounting. Every host path the container reads or writes — input parquets, output dirs, and the source-pool image root — must be bind-mounted. The simplest, most predictable approach mounts the workspace root with identical paths inside and outside the container so absolute paths in the parquet args resolve the same way on both sides:

WORKSPACE=<absolute path that contains all parquets, outputs, and the source-pool images>
DOCKER="docker run --gpus all --rm --ipc=host --user $(id -u):$(id -g) -v $WORKSPACE:$WORKSPACE -w $WORKSPACE $DS_IMAGE"

Reuse $DOCKER for the three invocations below.

CSV source pool. If the source pool is provided only as a CSV, convert it to a parquet up front with pd.read_csv(...).to_parquet(..., index=False), preserving the filepath column verbatim (and label if present). Do not add a path prefix — the container reads input parquets as-is and the $WORKSPACE mount keeps host and container paths identical.

Author the two spec files once per iteration. Both files live under $WORKSPACE so the -e argument resolves on both sides of the mount. Per-run values stay out of the spec and are passed as Hydra overrides at invocation time. The defaults are model: SigLIP, model_path: google/siglip-base-patch16-224, batch_size: 64 for embedding, and topn: 5, knn_metric: cosine, filter_by_label: "false" (quoted — the schema reads it as a string) for mining. Use cosine for SigLIP/CLIP, euclidean/manhattan otherwise; add model_config_path only when model_path is a TAO checkpoint. Any field can still be overridden inline at the CLI (e.g. topn=10) — Hydra applies CLI overrides on top of the spec.

See references/invocation.md for the verbatim spec-file templates, the CSV conversion snippet, and the full mounting and image-resolution detail.


Method

Three commands, in order. Each command's output parquet is the next command's input. Run them as plain Bash; the $DOCKER alias from Setup handles the container, GPU, and mounts. Every invocation follows the same shape: -e <spec> for the baked-in defaults, then a handful of Hydra overrides for run-specific paths.

Step 1 — Embed the target images

$DOCKER embedding image_embeddings -e <embedding_spec.yaml> \
    input_parquet=<target_parquet> output_parquet=<target_embeddings_parquet>

Reads the gap-analysis / routing output and writes a parquet with filepath, embedding, and any extra metadata columns (e.g. label, siamese_score, weakness) carried forward verbatim. Print the output schema (pd.read_parquet(...).columns) to stdout so the script-check hook can confirm the embedding column exists. To override model / model_path / batch_size for one run without editing the spec, append them as Hydra overrides.

Step 2 — Embed the source pool

$DOCKER embedding image_embeddings -e <embedding_spec.yaml> \
    input_parquet=<source_pool_parquet> output_parquet=<source_embeddings_parquet>

Same command shape as Step 1, applied to the source pool. Use the identical embedding_spec.yaml as Step 1, and do not override model / model_path / batch_size differently here — mismatched encoder configs across the two steps produce non-comparable embeddings.

Step 3 — Mine nearest neighbours

$DOCKER tmm nearest_neighbors -e <mining_spec.yaml> \
    source_parquet=<source_embeddings_parquet> \
    target_parquet=<target_embeddings_parquet> output_parquet=<mined_parquet>

For each target embedding, finds the topn closest source embeddings under the chosen metric, deduplicates across targets, and writes a single-column (filepath) parquet of unique mined source paths. The container also drops a mining_summary.txt next to the output parquet with: query count, neighbour count, duplicates removed, and (when label filtering is on) kept-vs-dropped pair counts. Tweak topn, knn_metric, or filter_by_label via inline Hydra override when sweeping — no need to rewrite the spec. When filter_by_label=true but one embedding parquet is missing the label column, the container logs a warning and proceeds without filtering; if the mined output looks too large or contains cross-label pairs, scan the docker log for that warning first.

See references/invocation.md for the complete paste-and-edit recipe that runs all three steps as one streamed Bash block with row-count sanity prints.


Outputs

Write everything into a timestamped folder under the experiment / iteration directory. The packaging hook will add mining_config/ and claude_session.jsonl automatically when Mining_Report.md is written.

<output_dir>/mining_results/YYYY-MM-DD_HHMMSS/
├── Mining_Report.md            # Full mining report
├── embedding_spec.yaml         # The -e spec used for Steps 1 and 2
├── mining_spec.yaml            # The -e spec used for Step 3
├── target_embeddings.parquet   # Step 1 output (filepath, embedding, + carried metadata)
├── source_embeddings.parquet   # Step 2 output (filepath, embedding, + carried metadata)
├── mined.parquet               # Step 3 output — unique mined source filepaths
├── mining_summary.txt          # Auto-emitted next to mined.parquet by the container
├── mining_config/              # Auto-copied by hook
└── claude_session.jsonl        # Auto-copied by hook

At the start of the run, get the real timestamp by running date +%Y-%m-%d_%H%M%S in Bash. Do NOT hardcode or guess. If the user specifies a custom output path, use it directly but maintain the same internal layout.

The mined parquet is the artifact downstream training consumes. The two embedding parquets are intermediate but worth retaining: they are reusable across multiple mining runs against the same source pool, and they are the only place to look when a "looks unrelated" report needs encoder-level debugging.


Common pitfalls

The single most common cause of garbage output is mismatched encoders — both embedding steps must consume the same embedding_spec.yaml, and any model / model_path / batch_size override must apply to both steps or neither. Other frequent issues: skipping an embedding step, a missing label column under filter_by_label=true (silent no-op), spec files outside $WORKSPACE, unresolved ??? sentinels, TAO checkpoints without model_config_path, CSV pools not converted to parquet, host/container path mismatches, no GPU, the wrong image tag, and topn × N_targets exceeding the source size (expected, not a bug — report the actual mined count).

See references/troubleshooting.md for the full diagnosis and fix for each of these.


Report Structure

Keep the report tight (600–1200 words). Mining is a deterministic pipeline; the value is making the encoder choice, the row counts, and any silent filter no-ops auditable — not narrative. The report has seven sections: Verdict, Inputs, Encoder Consistency, Mining Run, Per-Label Breakdown (skipped if the target parquet has no label column), Output Sanity, and Recommended Actions.

See references/reporting_spec.md for the complete fill-in report template with every section and field.


Execution Order

  1. Resolve DS_IMAGE from versions.yaml (images.tao_toolkit.data_services), then run docker info, nvidia-smi, and docker image inspect "$DS_IMAGE" (pulling if missing) once to confirm the environment. Abort with a clear message if any fail.
  2. Run date +%Y-%m-%d_%H%M%S to get the timestamp; create <output_dir>/mining_results/<timestamp>/.
  3. Write embedding_spec.yaml and mining_spec.yaml into the timestamped dir, filling in the encoder choice and mining knobs. Keep these under $WORKSPACE so the -e path resolves inside the container.
  4. If the source pool is a CSV, convert to parquet first (preserve filepath and label).
  5. Run Step 1 (embed targets) via docker run … embedding image_embeddings -e embedding_spec.yaml input_parquet=… output_parquet=…. Print the output parquet's row count and columns to stdout.
  6. Run Step 2 (embed source pool) with the identical embedding_spec.yaml as Step 1. Print output row count and columns.
  7. Run Step 3 (mine nearest neighbours) via docker run … tmm nearest_neighbors -e mining_spec.yaml source_parquet=… target_parquet=… output_parquet=…. Confirm mining_summary.txt was written next to mined.parquet.
  8. Compute the per-label breakdown (Section 5) by joining the target embeddings parquet with the mined output on filepath, if both carry label.
  9. Write Mining_Report.md last — writing it triggers the packaging hook, which copies session logs and skill config alongside.

Bundled with this artifact

12 files

Reference files that ship alongside this artifact. Agents pull these in only when the task needs them.

More on the bench

SKILL0

Whisper

OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR.

data-science-ml+2
0
SKILL0

Guidance

Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance - Microsoft Research's constrained generation framework

ai-prompt-engineering+2
0
SKILL0

Pinecone

Managed vector database for production AI applications. Fully managed, auto-scaling, with hybrid search (dense + sparse), metadata filtering, and namespaces. Low latency (<100ms p95). Use for production RAG, recommendation systems, or semantic search at scale. Best for serverless, managed infrastructure.

data-science-ml+2
0