Nemo Mbridge Perf Expert Parallel Overlap

Validate and use MoE expert-parallel communication overlap in Megatron-Bridge, including overlap_moe_expert_parallel_comm, delay_wgrad_compute, and flex dispatcher backends such as DeepEP and HybridEP.

Published by @NVIDIA·0 agent reads / 30d·0 saves·

MoE Expert-Parallel Overlap Skill

References

  • Stable docs: @docs/training/communication-overlap.md
  • Structured metadata: @skills/nemo-mbridge-perf-expert-parallel-overlap/card.yaml

What It Is

Expert-parallel (EP) overlap hides the cost of token dispatch/combine all-to-all communication by running it concurrently with expert FFN compute. Optionally, delayed expert weight-gradient computation (delay_wgrad_compute) provides additional overlap by deferring wgrad to overlap with the next layer's forward.

Bridge supports two dispatcher paths:

DispatcherBackendWhen to use
alltoallStandard MoE all-to-allDefault, broadest compatibility
flexDeepEP or HybridEPHigher overlap on Ampere/Hopper/Blackwell

Quick Decision

Use EP overlap when:

  • the model is MoE with EP > 1
  • expert dispatch/combine communication is a meaningful part of step time
  • you have memory headroom and are tuning for throughput

Prefer:

  • alltoall dispatcher for the first rollout (broader compatibility)
  • flex + DeepEP/HybridEP when running on supported GPUs and seeking additional gains

Avoid EP overlap when:

  • full activation recompute is enabled
  • moe_shared_expert_overlap is enabled
  • the run is still being brought up for correctness
  • PyTorch < 2.6.0

Expected outcome:

  • if all-to-all dispatch is a clear profile bottleneck, overlap can produce a modest to meaningful speedup
  • if the run is tiny, communication-light, or dominated by another wall, the gain may be negligible

Correctness-First alltoall Benchmark

For the plain EP-overlap isolation benchmark, keep flex dispatch and delayed wgrad disabled. The measured shape was Qwen3 MoE 30B-A3B SFT on 16 H100 GPUs: EP=16, alltoall, BF16, global batch size 1024, CUDA graphs disabled, moe_permute_fusion=false, measured over iterations 3-8.

Use these overrides for the plain-overlap case:

--cuda_graph_impl none \
--moe_flex_dispatcher_backend None \
--moe_a2a_overlap false \
comm_overlap.overlap_moe_expert_parallel_comm=true \
comm_overlap.delay_wgrad_compute=false \
model.moe_shared_expert_overlap=false

Do not use --moe_a2a_overlap true for this isolation test: the performance harness helper enables both overlap_moe_expert_parallel_comm and delay_wgrad_compute, so it does not isolate plain EP overlap.

Steady-window timing from that benchmark:

CaseSteady meanRelative
no EP overlap41.25s1.000x
EP overlap31.31s1.317x
EP overlap plus delay_wgrad_compute31.20s1.322x

This is evidence for enabling plain EP overlap on this inter-node all-to-all shape. It does not show a meaningful independent win from delayed wgrad, and it does not validate fused MoE permutation because that path was disabled for the runtime stack.

Enablement

alltoall dispatcher

cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = False
cfg.model.moe_shared_expert_overlap = False

cfg.model.expert_model_parallel_size = 8
cfg.model.num_moe_experts = 64
cfg.model.moe_token_dispatcher_type = "alltoall"
cfg.model.bf16 = True
cfg.model.fp16 = False

Enable delay_wgrad_compute=True only after the plain overlap path is known to work and its extra compatibility constraints have been checked.

flex dispatcher (DeepEP or HybridEP)

from megatron.bridge.training.flex_dispatcher_backend import apply_flex_dispatcher_backend

cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = True
cfg.model.moe_shared_expert_overlap = False

apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="deepep")
# or: apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="hybridep")

Compatibility And Constraints

  • expert_model_parallel_size > 1
  • num_moe_experts > 1
  • moe_token_dispatcher_type must be "alltoall" or "flex"
  • moe_shared_expert_overlap = False
  • Base precision is BF16 or FP16
  • PyTorch >= 2.6.0
  • If PP > 1, virtual_pipeline_model_parallel_size must be set
  • recompute_granularity != "full", recompute_method = None, recompute_num_layers = None
  • mtp_num_layers must be None or 1
  • delay_wgrad_compute requires overlap_moe_expert_parallel_comm as a prerequisite
  • delay_wgrad_compute with overlap_grad_reduce requires TE >= 2.7.0
  • delay_wgrad_compute with gradient_accumulation_fusion requires TE >= 2.7.0
  • CUDA graph attn scope + delay_wgrad_compute requires TE >= 2.12.0, gradient_accumulation_fusion = True, and no attention bias
  • DeepEP: Ampere, Hopper, B200, B300 GPUs only
  • HybridEP: Ampere, Hopper, B200, B300, GB200/GB300 with NVL72

Minimal Working Config

cfg.comm_overlap.overlap_moe_expert_parallel_comm = True
cfg.comm_overlap.delay_wgrad_compute = False
cfg.model.expert_model_parallel_size = 4
cfg.model.num_moe_experts = 64
cfg.model.moe_token_dispatcher_type = "alltoall"
cfg.model.moe_shared_expert_overlap = False
cfg.model.bf16 = True

Use this as the correctness-first starting point. Add delayed wgrad, flex dispatch, and CUDA-graph interactions only after the plain overlap path is known to work.

Minimal Runnable Command

Performance harness example inside a Slurm allocation. Keep the model, parallelism, dispatcher, and runtime fixed, and vary only the two overlap overrides:

uv run python scripts/performance/run_script.py \
  -m qwen \
  -mr qwen3_30b_a3b \
  --task pretrain \
  -g h100 \
  -c bf16 \
  -ng 16 \
  -gn 8 \
  --max_steps 8 \
  --config_variant v1 \
  --cuda_graph_impl none \
  --moe_flex_dispatcher_backend None \
  --moe_a2a_overlap false \
  --tokenizer_type NullTokenizer \
  comm_overlap.overlap_moe_expert_parallel_comm=true \
  comm_overlap.delay_wgrad_compute=false \
  model.moe_shared_expert_overlap=false

Do not use --moe_a2a_overlap true when separating plain EP overlap from delayed wgrad: the performance harness helper enables both overlap_moe_expert_parallel_comm and delay_wgrad_compute.

Unit test verification:

uv run python -m pytest \
  tests/unit_tests/training/test_comm_overlap.py -k "moe" \
  tests/unit_tests/training/test_deepep.py -q

Verification

Unit tests

uv run python -m pytest \
  tests/unit_tests/training/test_comm_overlap.py \
  tests/unit_tests/training/test_deepep.py -q

Log checks

After a successful run with EP overlap:

  1. Confirm no assertion errors during CommOverlapConfig finalization
  2. Confirm overlap_moe_expert_parallel_comm appears as True in the logged config
  3. If using flex dispatcher, confirm moe_token_dispatcher_type = "flex" and the correct backend in logs

Success criteria

  • Config validation passes for the selected dispatcher and overlap settings
  • Training runs complete without hangs or assertion failures
  • Throughput improves or at least does not regress for the target workload
  • Loss trajectory matches baseline (overlap should not affect convergence)

Code Anchors

Bridge overlap validation

if self.user_comm_overlap_cfg.overlap_moe_expert_parallel_comm is True:
    assert model_cfg.expert_model_parallel_size > 1, ...
    assert model_cfg.num_moe_experts > 1, ...
    assert model_cfg.moe_token_dispatcher_type in ["alltoall", "flex"], ...
    assert model_cfg.bf16 or model_cfg.fp16, ...
    assert is_torch_min_version("2.6.0"), ...
    # ... PP + VPP check, recompute checks, shared_expert_overlap check ...

Delayed wgrad validation

if self.user_comm_overlap_cfg.delay_wgrad_compute is True:
    # TE version checks for overlap_grad_reduce and gradient_accumulation_fusion
    # CUDA graph scope validations for delayed wgrad
    assert overlap_moe_expert_parallel_comm, ...

Flex-dispatcher activation

def apply_flex_dispatcher_backend(...):
    # GPU architecture check for DeepEP / HybridEP
    model_config.moe_token_dispatcher_type = "flex"
    model_config.moe_flex_dispatcher_backend = moe_flex_dispatcher_backend
    model_config.moe_shared_expert_overlap = False

Perf harness override

def _set_moe_a2a_overlap_overrides(recipe, moe_a2a_overlap=False):
    if moe_a2a_overlap:
        recipe.comm_overlap.overlap_moe_expert_parallel_comm = True
        recipe.comm_overlap.delay_wgrad_compute = True
        recipe.model.moe_shared_expert_overlap = False

Tests

FileCoverage
tests/unit_tests/training/test_comm_overlap.pyEP overlap validation, delayed wgrad, CUDA graph + wgrad interaction
tests/unit_tests/training/test_deepep.pyDeepEP/HybridEP helper activation and GPU gating

Failure Diagnosis

SymptomLikely CauseHow To ConfirmFix
assert expert_model_parallel_size > 1EP not configuredCheck expert_model_parallel_sizeSet EP > 1
assert moe_token_dispatcher_typeWrong dispatcherCheck dispatcher typeUse "alltoall" or "flex"
assert on BF16/FP16Wrong precisionCheck bf16 and fp16Set bf16 = True
hang during trainingPyTorch < 2.6Check PyTorch versionUpgrade to >= 2.6.0
assert virtual_pipeline_model_parallel_sizePP > 1 without VPPCheck PP and VPP configSet VPP when PP > 1
assert recompute_granularityFull recompute enabledCheck recompute settingsDisable full recompute
assert overlap_moe_expert_parallel_comm requireddelayed wgrad without EP overlapCheck delay_wgrad_compute without overlapEnable EP overlap first
assert gradient_accumulation_fusionCUDA graph + delayed wgradCheck graph scope + wgrad settingsEnable gradient_accumulation_fusion
assert on attention biasCUDA graph attn + delayed wgrad + biasCheck add_bias_linear / add_qkv_biasDisable attention bias
no throughput gain from flex dispatcherapply_flex_dispatcher_backend not calledCheck moe_token_dispatcher_type in logsCall apply_flex_dispatcher_backend(...)
DeepEP/HybridEP silently skippedUnsupported GPUCheck warning logsRun on Ampere/Hopper/Blackwell

Known Limitations

  • Setting moe_flex_dispatcher_backend alone does not activate flex dispatch — you must call apply_flex_dispatcher_backend(...).
  • Public recipes are often conservative and leave MoE overlap disabled by default.
  • End-to-end throughput gains have not yet been measured in a controlled Bridge experiment for every model family. Code validation is stronger than a single universal performance claim.
  • MoE overlap and shared-expert overlap are mutually exclusive.
  • CUDA graph plus delayed wgrad is a multi-constraint path that requires careful TE version and scope validation.

Bundled with this artifact

5 files

Reference files that ship alongside this artifact. Agents pull these in only when the task needs them.

More on the bench

SKILL0

Whisper

OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR.

data-science-ml+2
0
SKILL0

Guidance

Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance - Microsoft Research's constrained generation framework

ai-prompt-engineering+2
0
SKILL0

Pinecone

Managed vector database for production AI applications. Fully managed, auto-scaling, with hybrid search (dense + sparse), metadata filtering, and namespaces. Low latency (<100ms p95). Use for production RAG, recommendation systems, or semantic search at scale. Best for serverless, managed infrastructure.

data-science-ml+2
0