Nemo Mbridge Recipe Recommender

Recommend and customize Megatron Bridge recipes for a user's model, GPU count, and training goal. Indexes library recipes (pretrain/SFT/PEFT) and performance recipes.

Published by @NVIDIA·0 agent reads / 30d·0 saves·

Auto Recipe — Recipe Index & Recommendation

This skill indexes every shipped recipe and helps users pick the right starting config, adjust parallelism, and avoid common pitfalls.

How to Use This Skill

  1. Ask the user for: model name/size, GPU count & type, training goal (pretrain / SFT / PEFT), and sequence length (if non-default).
  2. Look up the best-match recipe in the index below.
  3. Recommend the recipe function name + entry-point command.
  4. Provide adjustment advice (parallelism resizing, batch tuning, pitfalls).

First Answer Checklist

When recommending recipes, always include these distinctions before the long index details:

  1. Library recipes under src/megatron/bridge/recipes/ are for functional training and use scripts/training/run_recipe.py.
  2. Performance recipes under scripts/performance/ are for upper-bound throughput benchmarks. They use mock data and should not be presented as production training recipes.
  3. For a first-time Bridge smoke test, recommend llama3_8b_sft_config with mock data via --dataset llm-pretrain-mock. Do not use llm-finetune for the setup-only tryout unless the user specifically asks for an SFT data path.
  4. For normal SFT recommendations, use --dataset llm-finetune; for pretrain and mock validation recommendations, use --dataset llm-pretrain-mock.
  5. After the recipe and dataset, give the required resizing rules: TP must divide num_key_value_heads, keep TP within one node unless using NVL72-class interconnect, enable SP when TP > 1, configure CP for long context, DP is implicit, and reduce micro_batch_size first on OOM.

Entry Points

Library recipes (functional training)

# Pretrain with mock data
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
    --recipe <recipe_function_name> \
    --dataset llm-pretrain-mock

# SFT with SQuAD
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
    --recipe <recipe_function_name> \
    --dataset llm-finetune

# Override any field via CLI
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
    --recipe llama3_8b_pretrain_config \
    --dataset llm-pretrain-mock \
    'model.tensor_model_parallel_size=2' \
    'training.global_batch_size=64'

Performance recipes (throughput benchmarks)

python scripts/performance/run_script.py \
    --recipe <model_family> \
    --gpu_type h100 \
    --num_gpus 64 \
    --data mock

See the Performance Recipe Index for important caveats before using these for anything beyond throughput benchmarking.


Recipe Unification (Coming Soon — PR #2803)

PR #2803 is unifying performance recipes into the same Python function format used by library recipes. Key changes:

  • Perf recipes move from scripts/performance/configs/src/megatron/bridge/recipes/<family>/<model>_perf.py
  • Each perf recipe becomes a self-contained Python function (e.g. llama3_8b_h100_bf16_pretrain_config())
  • The old WorkloadBaseConfigset_workload_base_configsget_perf_optimized_recipe pipeline is removed
  • Shared helpers: _benchmark_common() (50 iters, timing, TE RNG), _perf_precision() (bf16 / fp8_cs / fp8_mx / nvfp4)

Why Python, not YAML? Previous YAML-based approaches had problems: recipe logic was split across multiple indirection layers, configs were not self-contained, and the two-level pipeline made maintenance and debugging difficult. Python functions are explicit, greppable, and composable.

After #2803 lands, both library and perf recipes will be invocable through the same run_recipe.py entry point.


Library Recipe Index

All recipes live under src/megatron/bridge/recipes/. Each function returns a ConfigContainer with model, training, optimizer, and data settings.

Llama

RecipeModeTPPPCPSPGPUs (min)Seq Len
llama2_7b_pretrain_configPretrain2124K
llama3_8b_pretrain_configPretrain2128K
llama3_8b_16k_pretrain_configPretrain212416K
llama3_8b_64k_pretrain_configPretrain214864K
llama3_8b_128k_pretrain_configPretrain21816128K
llama3_70b_pretrain_configPretrain84328K
llama3_70b_16k_pretrain_configPretrain8426416K
llama3_70b_64k_pretrain_configPretrain84412864K
llama31_405b_pretrain_configPretrain8161288K
llama3_8b_sft_configSFT2128K
llama3_70b_sft_configSFT44168K
llama31_405b_sft_configSFT88648K
llama3_8b_peft_configPEFT1118K
llama3_70b_peft_configPEFT2488K
llama31_405b_peft_configPEFT48328K

Qwen2 / Qwen2.5

RecipeModeTPPPSizes
qwen2_*_{pretrain,sft,peft}_configAll1–81–4500M, 1.5B, 7B, 14B, 32B, 72B
qwen25_*_{pretrain,sft,peft}_configAll1–81–4500M, 1.5B, 3B, 7B, 14B, 32B, 72B

Qwen3 (Dense)

RecipeModeTPPPCPSizes
qwen3_*_pretrain_configPretrain1–81–2600M–32B
qwen3_*_sft_configSFT1–81–2600M–32B
qwen3_600m_sft_128k_configSFT118600M (128K seq)
qwen3_*_peft_configPEFT11600M–32B

Qwen3 MoE

RecipeModeTPPPEPCPGPUs
qwen3_30b_a3b_pretrain_configPretrain1188
qwen3_30b_a3b_sft_configSFT1188
qwen3_30b_a3b_peft_configPEFT1111
qwen3_235b_a22b_pretrain_configPretrain41682512+
qwen3_235b_a22b_sft_configSFT488256
qwen3_235b_a22b_peft_configPEFT14416

Qwen3-Next

RecipeModeTPPPEP
qwen3_next_80b_a3b_pretrain_configPretrain148
qwen3_next_80b_a3b_sft_configSFT128
qwen3_next_80b_a3b_peft_configPEFT114

DeepSeek

RecipeModeTPPPEPGPUs
deepseek_v2_lite_pretrain_configPretrain1188
deepseek_v2_pretrain_configPretrain1432128
deepseek_v3_pretrain_configPretrain216642048
deepseek_v3_pretrain_config_32nodesPretrain2832256

GLM-4.5

RecipeModeTPPPEPGPUs
glm45_355b_pretrain_configPretrain2816256
glm45_air_106b_pretrain_configPretrain14832
glm45_355b_sft_configSFT2816256
glm45_air_106b_sft_configSFT14832
glm45_355b_peft_configPEFT24432
glm45_air_106b_peft_configPEFT1248

Gemma

RecipeModeTPPPSizes
gemma2_*_{pretrain,sft,peft}_configAll2–81–22B, 9B, 27B
gemma3_1b_{pretrain,sft,peft}_configAll111B (32K seq)

NemotronH / Nemotron

RecipeModeTPPPEPNotes
nemotronh_{4b,8b,47b,56b}_*_configP/S/PEFT1–81–4Dense SSM-hybrid
nemotron_3_nano_*_configP/S/PEFTvaries18MoE + Mamba
nemotron_3_super_*_configP/S/PEFT418MoE + Mamba, ~40% CUDA graph gain
nemotron_nano_{9b,12b}_v2_*_configP/S/PEFTvaries1Dense

Other Models

RecipeModeNotes
moonlight_16b_{pretrain,sft,peft}_configAllMoE EP=8
olmoe_7b_{pretrain,sft,peft}_configAllMoE EP=8
ministral3_{3b,8b,14b}_{sft,peft}_configSFT/PEFTDense
gpt_oss_20b_*_configAllMoE + FP8/MXFP8 variants
gpt_oss_120b_*_configAllMoE
vanilla_gpt_pretrain_configPretrainMLM/Bridge parity baseline
gpt3_175b_pretrain_configPretrainTP=4, PP=8, VP=6
kimi_k2_pretrain_configPretrain1T MoE, TP=2 PP=16 EP=32

VLM Recipes

RecipeModeTPPPEPGPUs
gemma3_vl_{4b,12b,27b}_{sft,peft}_configSFT/PEFT1–81–21–16
qwen25_vl_{3b,7b,32b,72b}_{sft,peft}_configSFT/PEFT1–81–41–32
qwen3_vl_{8b,30b_a3b,235b_a22b}_{sft,peft}_configSFT/PEFT1–41–81–321–512
qwen35_vl_*_{sft,peft}_configSFT/PEFTvariesvariesvariesvaries
glm_45v_{sft,peft}_configSFT/PEFT184–1664–512
nemotron_nano_v2_vl_12b_{sft,peft}_configSFT/PEFT2–418

Diffusion Recipes

RecipeModeTPCP
wan_1_3B_{pretrain,sft}_configP/SFT18
wan_14B_{pretrain,sft}_configP/SFT24
flux_12b_{pretrain,sft}_configP/SFT21

Performance Recipe Index

All perf recipes live under scripts/performance/. They are invoked via run_script.py and use WorkloadBaseConfig presets per GPU type.

Important: Perf recipes are designed for upper-bound throughput benchmarks, not production training. They run 50 iterations on mock data by default. Throughput numbers are aspirational targets, not validated convergence configs.

Llama 3 / 3.1

ModelGPUsGPU TypesKey Features
Llama 3 8B8H100, B200, B300, GB200, GB300, R100CUDA graphs (local), FSDP on GB variants
Llama 3 70B64H100, B200, B300, GB200, GB300TP comm overlap (userbuffers), FSDP, CUDA graphs
Llama 3.1 405B128–1024H100, B200, B300, GB200, GB300TP+CP comm overlap (userbuffers), FSDP, heavy PP/VP

SFT/LoRA variants also exist (e.g. 8B SFT with packed sequences, 70B SFT on 32 GPUs).

DeepSeek V3

ModelGPUsGPU TypesKey Features
DeepSeek V3 (671B MoE)256–1024H100, B200, B300, GB200, GB300HybridEP dispatcher, MLA recompute, CUDA graphs (TE scoped)

Qwen3 MoE

ModelGPUsGPU TypesKey Features
Qwen3 30B-A3B8–16H100, B200, B300, GB200, GB300MoE alltoall/flex dispatcher
Qwen3 235B-A22B64–256H100, B200, B300, GB200, GB300TP comm overlap, CUDA graphs, MoE a2a overlap
Qwen3-Next 80B-A3B64–128H100, B200, B300, GB200, GB300EP 64–128

Qwen3-VL

ModelGPUsGPU TypesKey Features
Qwen3-VL 30B-A3B8–16H100, B200, B300, GB200, GB300VLM + MoE
Qwen3-VL 235B-A22B64–256H100, B200, B300, GB200, GB300VLM + MoE, TP comm overlap

Kimi K2

ModelGPUsGPU TypesKey Features
Kimi K2 (1T MoE)256–1024H100, B200, B300, GB200, GB300Muon/Adam optimizer, HybridEP, pipeline layout helpers

NemotronH

ModelGPUsGPU TypesKey Features
Nemotron 3 Nano (30B MoE+Mamba)8–16H100, B200, B300, GB200, GB300TE CUDA graphs (attn+mamba+moe), HybridEP
Nemotron 3 Super64H100, B200, B300, GB200, GB300TE CUDA graphs, EP=64
NemotronH 56B64H100, B200, B300TP=2–8, TE graphs (mamba+attn)

GPT-OSS

ModelGPUsGPU TypesKey Features
GPT-OSS 120B64H100, B200, GB200EP=64, HybridEP on GB200

Recommendation Decision Tree

User wants to train a model
│
├─ Know the model name?
│   ├─ Yes → Look up in Library Recipe Index above
│   │   ├─ Has a recipe for their size + mode? → Use it directly
│   │   └─ No exact match? → Use closest size, adjust parallelism
│   └─ No → Ask for model name, size, and HF model ID
│
├─ What's the training goal?
│   ├─ Pretrain → Use *_pretrain_config
│   ├─ SFT (full fine-tune) → Use *_sft_config
│   └─ PEFT (LoRA/DoRA) → Use *_peft_config (lowest GPU requirement)
│
├─ How many GPUs?
│   ├─ 1 GPU → Only PEFT recipes work (TP=1, PP=1)
│   ├─ 8 GPUs (1 node) → Most 8B–16B models, small MoE (EP=8)
│   ├─ 16–64 GPUs → 70B dense, medium MoE
│   └─ 128+ GPUs → 405B+, large MoE (DeepSeek V3, Kimi K2)
│
├─ Want throughput benchmarks?
│   ├─ Yes → Use perf recipes (scripts/performance/)
│   │   └─ ⚠️ These run on mock data for upper-bound perf only
│   └─ No → Use library recipes (scripts/training/run_recipe.py)
│
└─ Long context?
    ├─ > 8K → Need CP (context parallelism), check *_16k / *_64k / *_128k variants
    └─ ≤ 8K → Default recipes work

Adjustment Advice (When Recommending)

Parallelism Resizing Rules

When the user's GPU count differs from the recipe default:

  1. TP must divide num_key_value_heads (GQA constraint). E.g. if num_key_value_heads=8, valid TP = {1, 2, 4, 8}.
  2. TP should stay within a single node (NVLink). TP > 8 requires inter-node NVLink (e.g., GB200 NVL72).
  3. PP adds pipeline bubbles. Minimize PP; only increase when TP alone can't fit the model. Use VP (virtual pipeline) to mitigate bubble overhead.
  4. EP doesn't reduce dense-layer memory. Only expert parameters shard with EP. Shared attention/embeddings are replicated. For "OOM with MoE", increase EP first, not TP.
  5. SP should be True whenever TP > 1. It eliminates redundant activation copies and is essentially free.
  6. CP requires all-to-all or ring attention. Check cp_comm_type. For GQA models, a2a+p2p hierarchical CP allows CP > num_kv_heads.
  7. world_size = DP × TP × PP × CP × EP. DP is implicit. Make sure the product of explicit parallelisms divides your total GPU count.

Batch Size Tuning

  • Start with the recipe's micro_batch_size. If OOM, reduce to 1.
  • global_batch_size determines learning dynamics. Scale with DP: GBS = micro_batch_size × DP × gradient_accumulation_steps.
  • For MoE, micro_batch_size=1 is typical at scale.

Common Pitfalls to Warn About

PitfallSymptomFix
TP > num_kv_headsCrash: "TP must divide num_query_groups"Reduce TP to a divisor of num_kv_heads
PP without VPPoor throughput (large bubble)Set virtual_pipeline_model_parallel_size
EP too low for large MoEOOM on expert paramsIncrease EP; each expert lives on EP/num_experts ranks
CUDA graphs + packed sequencesAssert: "CUDA graph accepts only Tensor inputs"Disable packing or use local full-iteration graphs
CUDA graphs + full recomputeAssert: "full recompute only with full iteration CUDA graph"Disable recompute or switch to local impl
use_te_rng_tracker not setAssert on provider init when CUDA graphs enabledSet cfg.model.use_te_rng_tracker = True and cfg.rng.te_rng_tracker = True
FSDP + TP > 1 on H100Possible comm bottleneckPrefer FSDP with TP=1 or TP=2 on H100; FSDP shines on GB/B-series
Long context without CPOOM on activationsAdd CP=2/4/8; use *_16k, *_64k, or *_128k recipe variants
MoE overlap_grad_reduce on H100May hurt perf (False in many H100 presets)Set overlap_grad_reduce=False for MoE on H100
VLM SFT missing image dataRuns but produces garbageProvide actual multimodal dataset or use mock VLM data
Qwen35-VL MoE FSDPTested on Blackwell onlyMay not work on H100; validate first

Recipe Override Examples

# Scale Llama3 8B from 2 GPUs to 8 GPUs (increase DP)
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
    --recipe llama3_8b_pretrain_config \
    --dataset llm-pretrain-mock

# Reduce parallelism for Qwen3-MoE 30B to fit on 4 GPUs
uv run python -m torch.distributed.run --nproc_per_node=4 scripts/training/run_recipe.py \
    --recipe qwen3_30b_a3b_sft_config \
    --dataset llm-finetune \
    'model.expert_model_parallel_size=4'

# Add long context to an existing recipe
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
    --recipe llama3_8b_pretrain_config \
    --dataset llm-pretrain-mock \
    'model.seq_length=32768' \
    'model.context_parallel_size=4'

# Enable CUDA graphs on any recipe
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
    --recipe qwen3_30b_a3b_pretrain_config \
    --dataset llm-pretrain-mock \
    'model.cuda_graph_impl=transformer_engine' \
    'model.cuda_graph_scope=[attn,moe_router,moe_preprocess]' \
    'model.use_te_rng_tracker=True' \
    'rng.te_rng_tracker=True'

Quick Reference: Which Recipe for My Situation?

I want to...Start withGPUs needed
Try Bridge for the first timellama3_8b_sft_config + mock data2
Fine-tune a 7-8B modelllama3_8b_sft_config or qwen3_8b_sft_config2–8
LoRA on 1 GPUllama3_8b_peft_config or qwen3_8b_peft_config1
Pretrain a dense 70Bllama3_70b_pretrain_config32–64
Train a small MoEqwen3_30b_a3b_pretrain_config8
Train a large MoE (235B+)qwen3_235b_a22b_pretrain_config256–512
Benchmark throughputPerf recipes via run_script.pyVaries
Long-context trainingllama3_8b_128k_pretrain_config or add CP override16+
VLM fine-tuningqwen3_vl_8b_sft_config or gemma3_vl_*_sft_config4–8
Diffusion trainingwan_1_3B_pretrain_config or flux_12b_pretrain_config8

Code Anchors

WhatPath
Library recipes rootsrc/megatron/bridge/recipes/
Recipe __init__.py (all exports)src/megatron/bridge/recipes/__init__.py
Common recipe helperssrc/megatron/bridge/recipes/common.py
Training entry pointscripts/training/run_recipe.py
Perf recipes rootscripts/performance/
Perf entry pointscripts/performance/run_script.py
Perf workload configsscripts/performance/configs/<family>/
Perf overrides (benchmark defaults)scripts/performance/utils/overrides.py

Bundled with this artifact

4 files

Reference files that ship alongside this artifact. Agents pull these in only when the task needs them.

More on the bench

SKILL0

Tensorflow And Deep Learning Rules

TensorFlow and deep learning rules for building, training, evaluating, and deploying neural network models

data-science-ml+1
0
SKILL0

Fortran Programming Guidelines

Modern Fortran rules for scientific computing, modules, explicit interfaces, kind parameters, memory safety, and testing

software-engineering+1
0
SKILL0

Automl And Hyperparameter Optimization Rules

AutoML and hyperparameter optimization rules for Python ML projects using Ray Tune, Optuna, PyCaret, and time-series AutoML libraries

data-science-ml+1
0