Nemo Mbridge Perf Moe Hardware Configs

Representative MoE training playbooks by hardware platform and model family. Summarizes rounded throughput bands, parallelism patterns, and common tuning stacks.

Published by @NVIDIA·0 agent reads / 30d·0 saves·

MoE Hardware Configuration Reference

Stable docs: @docs/training/moe-optimization.md Card: @skills/nemo-mbridge-perf-moe-hardware-configs/card.yaml

Quick Platform Playbook

PlatformTypical MoE strategyWhat usually matters most
H100DeepEP + stronger PP + moderate TPcommunication overlap and PP efficiency
B200DeepEP + MXFP8 + careful PP layoutcontainer quality and tuned comm settings
GB200HybridEP + partial CUDA graphs + CPU cleanuphost overhead, topology-aware dispatch, memory headroom
GB300HybridEP + newer FP8 and kernel stacksame GB200 playbook, usually with a higher ceiling

First Answer Checklist

For hardware playbook questions, answer from these canonical rows before adding throughput caveats:

WorkloadHardwareDispatcherLayout
DSV3H100DeepEPTP=2, EP=64, PP=8, VPP=4
DSV3GB200/GB300HybridEPTP=1, EP=64, PP=4, VPP=4
Qwen3 235BH100DeepEPTP=2, EP=32, PP=8, VPP=4
Qwen3 235BGB200HybridEPTP=1 or 2, EP=32-64, PP=4, VPP=unspecified

For Qwen3 235B on GB200, explicitly say VPP=unspecified; do not invent or extrapolate VPP=12 unless a measured row provides it. Include TE-scoped CUDA graph scopes (attn, moe_router, moe_preprocess), CUDA_DEVICE_MAX_CONNECTIONS selection, PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True, NCCL_GRAPH_REGISTER=0, GB200/GB300 CPU-side tuning, and the warning not to cargo-cult tracker rows.

Rounded Performance Bands

These are intentionally rounded so the document stays durable as the tracker moves. Treat them as planning ranges, not exact promises.

Workload familyHardwareTypical bandRepresentative shape
DSV3, large-scaleH100low-to-mid hundreds TFLOPS/GPU, high-teens MFUTP2, EP64, PP8, DeepEP
DSV3, large-scaleB200high-hundreds TFLOPS/GPU, mid-teens MFUTP1, EP32, PP8, DeepEP
DSV3, large-scaleGB200around 1K TFLOPS/GPU, low-20s MFUTP1, EP64, PP4, HybridEP
DSV3, large-scaleGB300above the GB200 band, often mid-20s MFUTP1, EP64, PP4, HybridEP
Qwen3 235BH100low-300s TFLOPS/GPU, around 30% MFUTP2, EP32, PP8, DeepEP
Qwen3 235BGB200high-hundreds TFLOPS/GPU in tuned runsTP1 or TP2, EP32-64, PP4, HybridEP
Qwen3 30BH100low-200s TFLOPS/GPUTP1, EP8, PP1, DeepEP
Qwen3-Next 80BGB200low-300s TFLOPS/GPU in BF16-class runsTP1, EP32, PP2, HybridEP

Representative Config Families

DSV3 on H100

Dispatcher: DeepEP
TP=2  EP=64  PP=8  VPP=4
Routing: force balance
Recompute: light-to-moderate selective recompute
Priority: overlap communication and keep PP efficient

DSV3 on B200

Dispatcher: DeepEP
TP=1  EP=32  PP=8  VPP=2 or similar
Precision: MXFP8-class
Recompute: selective recompute around MLA up-projection and MLP-side modules
Priority: container quality, PP layout, and DeepEP SMS tuning

DSV3 on GB200 or GB300

Dispatcher: HybridEP
TP=1  EP=64  PP=4  VPP=4
Precision: MXFP8-class
CUDA Graph: attn + moe_router + moe_preprocess
Priority: HybridEP, CPU optimization, and graph-friendly static shapes

Qwen3 235B on H100

Dispatcher: DeepEP
TP=2  EP=32  PP=8  VPP=4
Recompute: norm and activation-side selective recompute
Priority: communication overlap and router-path cleanup

Qwen3 235B on GB200

Dispatcher: HybridEP
TP=1 or 2  EP=32 to 64  PP=4  VPP=unspecified unless measured
CUDA Graph: attn + moe_router + moe_preprocess
Recompute: moe_act, mlp, or norm depending on memory pressure
Priority: balance throughput against memory headroom

Qwen3-Next 80B on GB200

Dispatcher: HybridEP
TP=1  EP=32  PP=2  VPP around 4
CUDA Graph: attn + moe_router + moe_preprocess
Priority: pipeline layout and grouped GEMM quality

Cross-Cutting Patterns

PP layout

  • E = embedding
  • t = transformer
  • m = MTP
  • L = loss
  • | = stage boundary

The biggest platform difference is usually not just the dispatcher. It is the combination of dispatcher, PP shape, and whether VPP keeps each stage balanced.

Recompute strategy

Memory pressureStarting point
lownone or a very narrow selective set
moderatemoe_act, mlp, norm, or similar selective modules
highmodel-specific up-projection plus selective MoE and MLP modules
extreme or long-contextfull recompute only if the selective path still does not fit

Environment variables

CUDA_DEVICE_MAX_CONNECTIONS=1
CUDA_DEVICE_MAX_CONNECTIONS=32   # common when EP overlap and CUDA graphs are combined
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
NCCL_GRAPH_REGISTER=0

CPU-side tuning

On GB200 and GB300, CPU affinity and general host-overhead cleanup can move the needle almost as much as a dispatcher swap. Treat them as first-class tuning work, not as afterthoughts.

Pitfalls

  1. Do not cargo-cult a tracker row: the winning config usually depends on routing mode, container, and PP layout as much as on hardware name.

  2. Container quality matters: large regressions can come from the software stack rather than the model recipe.

  3. VPP must be intentional: a bad VPP split can erase the gain from a better dispatcher.

  4. Compare absolute throughput, not only MFU: MFU can mislead when switching between BF16, FP8, and other precision modes.

  5. Force-balance routing is the safer benchmark default: keep routing mode fixed when comparing hardware or dispatcher stacks.

Bundled with this artifact

5 files

Reference files that ship alongside this artifact. Agents pull these in only when the task needs them.

More on the bench

SKILL0

Tensorflow And Deep Learning Rules

TensorFlow and deep learning rules for building, training, evaluating, and deploying neural network models

data-science-ml+1
0
SKILL0

Fortran Programming Guidelines

Modern Fortran rules for scientific computing, modules, explicit interfaces, kind parameters, memory safety, and testing

software-engineering+1
0
SKILL0

Automl And Hyperparameter Optimization Rules

AutoML and hyperparameter optimization rules for Python ML projects using Ray Tune, Optuna, PyCaret, and time-series AutoML libraries

data-science-ml+1
0