AutoML + DEFT Pipeline

A workflow-bridge skill that runs three phases in sequence by delegating to two existing skills — tao-run-automl for HPO and a DEFT application skill (default tao-run-deft-aoi for AOI; other skills/applications/deft-* skills for non-AOI cases) for the iterative data-improvement loop.

This skill does not re-implement AutoML or DEFT. It owns only the connective tissue: HPO spec inputs, the spec-handoff between AutoML and DEFT, and the post-DEFT AutoML re-run on the augmented dataset.

When this skill applies

User asks to "run the AOI workflow" or "improve my AOI ChangeNet model" — default to this skill, not tao-run-deft-aoi directly. The bare DEFT loop is the inner stage of this pipeline.
User wants AutoML and DEFT chained on the same model/dataset
User says "AutoML at both ends", "tune HPs then DEFT", "warm-start DEFT", "AutoML before and after DEFT"
User has an AutoML-tuned spec and asks how to feed it into DEFT

When this skill does NOT apply

User explicitly asks for the DEFT loop only ("run JUST the DEFT loop", "skip AutoML") → use tao-run-deft-aoi directly
User wants only AutoML with no follow-on DEFT → use tao-run-automl directly
User is doing zero-shot eval, RAG, or non-training workflows

The mental model

Phase 1 (AutoML baseline)        Phase 2 (DEFT loop, plain train)        Phase 3 (AutoML refinement)
─────────────────────────        ────────────────────────────────        ───────────────────────────
specs/baseline_spec.yaml         (Phase 1 winner pre-seeds baseline      ${RESULTS_DIR}/iter${N}/dataset/
train/base/training_set.csv       — DEFT skips its baseline train)       train_combined_iter${N}.csv
        │                                       │                                       │
        ▼                                       ▼                                       ▼
[ AutoML HPO sweep ]               [ DEFT: baseline-inference → RCA       [ AutoML HPO sweep ]
   N recommendations                 → iter 1..N (plain retrain) ]        re-tunes HPs against the
   pick best by val_loss / FAR      RCA / route / SDG / mining             DEFT-augmented dataset
        │                                       │                                       │
        ▼                                       ▼                                       ▼
best HPs spec + ckpt ─────►      DEFT-augmented CSV ───────────►        final best checkpoint
                                 + iter winner checkpoint               (the deliverable; no
                                 (Phase 3 warm-starts from it)           further retrain)

The handoffs are:

Phase 1 → Phase 2: a spec file AND the winning checkpoint. Retraining the same HPs in DEFT's baseline step is wasted compute, so the bridge deep-merges Phase 1's winning HPs onto baseline_spec.yaml, copies the winning checkpoint into ${RESULTS_DIR}/baseline/train/ under the filename DEFT expects, and pre-populates deft_state.json + loop_log.jsonl so DEFT resumes at baseline inference → evaluate → RCA → iter 1. DEFT itself stays plain-train (automl_policy: off preserved). Verbatim 4-step procedure in references/handoff.md.
Phase 2 → Phase 3: a training CSV AND the iter winner's checkpoint. The CSV (train_combined_iter${N_final}.csv) is AutoML's training data; the checkpoint (iterations.<best>.best_ckpt_path from deft_state.json) is wired into each rec's train.pretrained_model_path so Phase 3 fine-tunes from Phase 2's winner rather than from scratch. Without this warm-start Phase 3 routinely regresses vs the iter winner. Phase 3's winning checkpoint is the deliverable — no separate retrain after Phase 3. See references/handoff.md.

Why three phases instead of two

Phase 1 alone finds good HPs on the original training distribution, but the model still has the distributional gaps DEFT is designed to fill.
Phase 2 alone (just DEFT) fills the gaps but uses whatever HPs specs/baseline_spec.yaml was hand-authored with — usually not optimal.
Phase 3 alone would run AutoML against the augmented dataset, but without a tuned baseline the DEFT loop's iteration cost is higher (slower convergence, more iterations to hit the KPI).

Running all three: AutoML cheap-tunes once on the original data, DEFT does the heavy data work with reasonable HPs, then AutoML tunes again on the now-richer dataset. Phase 3 is the most important of the three for the final deployed FAR/recall.

Cost up-front

The pipeline is sequential. Total wall-clock ≈ Phase 1 (N_automl × per-rec train) + Phase 2 (M iterations × per-iter cost) + Phase 3 (N_automl × per-rec train).

Note that Phase 2 has no separate baseline train — Phase 1's winning checkpoint is reused as DEFT's baseline, so the baseline cost lands inside Phase 1's N_automl trainings rather than as an extra retrain. Surface this to the user before kickoff. Typically Phase 2's iterations still dominate (each includes SDG + retrain), but Phase 1 and Phase 3 each add several hours on a single-GPU box. Use the per-job estimate from the user's setup (if they have one) rather than guessing minutes. See references/pitfalls.md for the per-phase cost breakdown.

Consolidated Pre-Flight — one gate, all three phases

The pipeline has exactly one user gate. Before any side-effecting action (docker pull, docker login, any job-launch call delegated to a downstream skill, file mutations under ${RESULTS_DIR}/), the agent must produce a single consolidated Pre-Flight Summary that subsumes every downstream skill's preflight. Once the user approves, the run is autonomous through all three phases — no further interactive pauses.

The user explicitly does not want to be paged between phases. The DEFT loop's own inline ## Pre-Flight Summary gate becomes a zero-question display step (every value pre-supplied), as does tao-run-automl's shared launch preflight in Phases 1 and 3.

Before printing the gate the agent must read every downstream preflight section in full and run every read-only check those sections prescribe, surfacing each outcome in the summary. Running every step of the DEFT skill's ## Pre-Flight is mandatory — if any step is skipped the consolidated gate is invalid and the pipeline must not advance. The summary must include, in order: (1) workspace/host/platform/network, (2) credentials SET/UNSET status, (3) resolved container image URIs with PRESENT/MISSING, (4) dataset table with leakage check, (5) Phase 1 config, (6) Phase 2 config incl. pre-seeded baseline source, (7) Phase 3 config, (8) compute estimate, (9) the confirmation line. After the gate, pass every collected value through to each downstream skill so it has nothing to ask. The only allowed post-gate pauses are mid-run hard-stop safety gates (e.g. DEFT's KPI regression gate); call them out in the summary.

See references/preflight.md for the full build procedure, the exact mandatory contents of each summary section (with the GPU memory rule of thumb, DEFT loop defaults, and required inputs verbatim), the downstream gate-suppression inputs, and the fallback when an older skill-bank version hard-codes its own STOP gate.

Phase 1 — AutoML baseline

Invoke tao-skill-bank:tao-run-automl with:

Input	AOI default	Notes
`network_arch`	`visual-changenet`	Same model the DEFT loop expects
`train_dataset_uri`	`<workspace>/train/base/training_set.csv`	Same training set DEFT will start from
`eval_dataset_uri`	`<workspace>/train/base/validation_set.csv`	Held-out — must NOT be the KPI test set (`<workspace>/kpi/testing_set.csv`), since that set is reserved for DEFT's final reporting
`metric`	FAR @ 100% recall (preferred) or `val_loss`	See `references/pitfalls.md` — ChangeNet AOI is class-imbalanced, val_loss alone can mode-collapse
`algorithm`	`bayesian`	LLM-brain or `autoresearch` if compute is tight
`automl_max_recommendations`	5–10 for AOI	More recs = better HPs but linear in compute
`spec_overrides`	Pin epochs / batch_size; sweep optimizer-related HPs only	Otherwise AutoML wanders into long-train regimes that blow Phase 2's budget

After the sweep finishes, AutoML's result["best"]["specs"] is the winning hyperparameter dict.

Handoff to Phase 2

Phase 1 hands over two artifacts: the winning spec and the winning checkpoint. Instead of retraining the same HPs in DEFT's baseline step, pre-seed DEFT's baseline state from Phase 1's outputs so DEFT starts at baseline inference → evaluate → RCA → iter 1. The four steps — write the merged baseline_spec_automl.yaml, copy the winning checkpoint into ${RESULTS_DIR}/baseline/train/, initialise deft_state.json with iterations.baseline.stage_completed == "train" (and append the matching loop_log.jsonl entry), then invoke DEFT — are given verbatim with the exact code in references/handoff.md. automl_policy: off inside the loop is preserved.

Quality check before handing off

Run a quick eval of the winning checkpoint against the held-out set: per-class prediction counts (if it collapsed to one class, evaluate the 2nd or 3rd best instead) and a comparison to a zero-shot ChangeNet baseline (if AutoML did not improve over zero-shot, surface that and pause). See references/handoff.md.

Phase 2 — DEFT loop (plain training, baseline pre-seeded from Phase 1)

Invoke tao-skill-bank:tao-run-deft-aoi (read its SKILL.md for the full interface). For non-AOI applications, invoke the matching DEFT skill; the handoff shape is the same.

The DEFT loop's baseline-train sub-step is skipped. Phase 1 already produced a checkpoint trained at the winning HPs, and Phase 1's handoff (see above) pre-populated ${RESULTS_DIR}/baseline/train/ and ${RESULTS_DIR}/deft_state.json so DEFT resumes at baseline inference → evaluate → RCA → iter 1. The rest of the DEFT loop runs unchanged. Do not modify its automl_policy: off invariant.

The DEFT loop owns:

The Pre-Flight Summary display step — not a fresh user gate. The Consolidated Pre-Flight (above) is the single gate; the DEFT summary still prints as an audit-trail display of the pre-seeded baseline/train/ source but must not re-prompt, since every input was collected in the consolidated gate.
Baseline inference → evaluate → RCA on the pre-seeded checkpoint, and the full per-iteration RCA → routing → SDG → mining → assemble → train cycle.
KPI gating and stop conditions; ${RESULTS_DIR}/ layout, deft_state.json, loop_log.jsonl, DEFT_Loop_Report.html.

After the loop exits (KPI met or max_iterations reached), capture two values from deft_state.json:

iterations.<best>.best_ckpt_path — the loop's best plain-train checkpoint
The final iteration label N_final — used to locate the augmented training CSV

If the DEFT loop hard-stops on an unrecoverable gate, skip Phase 3. There is no validated augmented CSV to feed AutoML.

Phase 3 — AutoML refinement on the DEFT-augmented dataset

Re-invoke tao-skill-bank:tao-run-automl with the augmented training CSV as the train dataset, the same held-out validation CSV as before, and Phase 2's iter winner checkpoint as the warm-start:

Input	AOI value
`network_arch`	`visual-changenet`
`train_dataset_uri`	`${RESULTS_DIR}/iter${N_final}/dataset/train_combined_iter${N_final}.csv`
`eval_dataset_uri`	Same as Phase 1 (`<workspace>/train/base/validation_set.csv`) — keep the comparison apples-to-apples
`metric`	Same metric as Phase 1
`algorithm`	Same as Phase 1
`automl_max_recommendations`	5–10
Initial spec	Start from `<workspace>/specs/baseline_spec_automl.yaml` (Phase 1's winner) — gives the sweep a strong centroid to refine around
Warm-start checkpoint	`iterations.<best>.best_ckpt_path` from `${RESULTS_DIR}/deft_state.json` — set `spec_overrides["train"]["pretrained_model_path"]` to this path. Each Phase 3 rec then fine-tunes from Phase 2's winner instead of training from scratch.

The warm-start is mandatory: with no warm-start, every rec starts from random init with only 10-20 epochs to reconverge, Phase 3's val_loss regresses 0.03-0.05 vs iter1, and the _pick_best safety net silently rolls back to the iter winner — wasting Phase 3's compute. The concrete spec_overrides code (selecting the lowest-far_pct iteration, excluding any prior final_automl), the broad-exploration tradeoff, output to ${RESULTS_DIR}/final_automl/, and wiring Phase 3's checkpoint back into the DEFT report via iterations.final_automl + re-running prepare_inference_spec.py (with the _pick_best regression safety net) are all in references/handoff.md.

Pitfalls and quality checks

These apply to both AutoML phases — bake them into agent behavior, don't just paste once. The full detail is in references/pitfalls.md:

Metric pitfalls (AOI is class-imbalanced). ChangeNet AOI is PASS-dominant; val_loss can mode-collapse to a zero-recall PASS-everything model. Prefer FAR @ 100%-recall directly, or gate val_loss with a pred_counts sanity check, or decide top-K by FAR @ 100%-recall. For balanced / regression tasks, val_loss is fine.
Run-to-run noise. AutoML can show 2–3× metric variance for the same config. If the winner looks suspiciously better than the runner-up, re-run with a fresh seed before committing the spec to Phase 2.
Cleanliness (data leakage). Both AutoML phases use a validation set distinct from the KPI test set (kpi/testing_set.csv), which stays untouched until DEFT's evaluate stage. Phase 3 trains on the augmented CSV but keeps the same val set so Phase 1 and Phase 3 numbers stay comparable.
Compute budget. Surface the per-phase structure up front and only give a wall-clock range after the user supplies their per-job time.

Quick Start (AOI worked example)

When starting fresh from "run the AOI workflow", the agent delivers a three-phase worded message to the user (Phase 1 AutoML baseline → Phase 2 DEFT loop → Phase 3 AutoML refinement, with the cost framing and "OK to proceed?" close), then after confirmation invokes tao-run-automl (Phase 1), writes the merged spec, pre-seeds deft_state.json, invokes tao-run-deft-aoi (Phase 2) with every input pre-supplied, and invokes tao-run-automl again (Phase 3) — with no further pauses unless a downstream skill hits an unrecoverable hard-stop gate — then summarizes the trajectory (baseline AutoML best → DEFT iter 1 → ... → DEFT iter N_final → Phase 3 best).

See references/quick-start.md for the verbatim customer-facing message and the exact post-confirmation invoke sequence.

Non-AOI DEFT applications

The same three-phase pattern applies to other DEFT skills — swap network_arch, the Phase 2 DEFT skill, the spec/checkpoint path conventions, and the Phase 3 augmented-CSV path. The handoff shape (Phase 1 emits spec + checkpoint that pre-seeds the DEFT baseline, Phase 2 emits an augmented dataset, Phase 3 emits the final checkpoint) is identical, and the baseline-skip mechanism is generic to any DEFT-style loop with a resumable baseline state. See references/quick-start.md.

Tao Run Automl Deft Pipeline

AutoML + DEFT Pipeline

When this skill applies

When this skill does NOT apply

The mental model

Why three phases instead of two

Cost up-front

Consolidated Pre-Flight — one gate, all three phases

Phase 1 — AutoML baseline

Handoff to Phase 2

Quality check before handing off

Phase 2 — DEFT loop (plain training, baseline pre-seeded from Phase 1)

Phase 3 — AutoML refinement on the DEFT-augmented dataset

Pitfalls and quality checks

Quick Start (AOI worked example)

Non-AOI DEFT applications

See also

Bundled with this artifact

More on the bench

Whisper

Guidance

Pinecone