Nemo Mbridge Perf Tp Dp Comm Overlap

Operational guide for enabling TP, DP, and PP communication overlap in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.

Published by @NVIDIA·0 agent reads / 30d·0 saves·

TP / DP / PP Communication Overlap Skill

For stable background and recommendation level, see:

  • @docs/training/communication-overlap.md

Enablement

Minimal Bridge override:

from megatron.bridge.training.comm_overlap import CommOverlapConfig

cfg.model.tensor_model_parallel_size = 4
cfg.model.sequence_parallel = True
cfg.model.pipeline_model_parallel_size = 4
cfg.model.virtual_pipeline_model_parallel_size = 2

cfg.comm_overlap = CommOverlapConfig(
    tp_comm_overlap=True,
)

cfg.ddp.use_distributed_optimizer = True
cfg.ddp.overlap_grad_reduce = True
cfg.ddp.overlap_param_gather = True

Optional TP preset:

from megatron.bridge.training.comm_overlap import userbuffers_bf16_h100_h12288_tp4_mbs1_seqlen2048

cfg.comm_overlap.tp_comm_overlap_cfg = userbuffers_bf16_h100_h12288_tp4_mbs1_seqlen2048

Precision knobs belong to mixed precision:

cfg.mixed_precision.grad_reduce_in_fp32 = False
cfg.mixed_precision.fp8_param_gather = False

Code Anchors

Bridge overlap gating:

if self.user_comm_overlap_cfg.tp_comm_overlap is True:
    if model_cfg.tensor_model_parallel_size < 2:
        ...
    elif not model_cfg.sequence_parallel:
        ...
    elif not HAVE_TE:
        ...

PP overlap selection:

if model_cfg.pipeline_model_parallel_size > 1:
    if vp_size > 1:
        comm_overlap_cfg.overlap_p2p_comm = True
        comm_overlap_cfg.batch_p2p_comm = False
    else:
        comm_overlap_cfg.overlap_p2p_comm = False
        comm_overlap_cfg.batch_p2p_comm = True

DP overlap defaults:

if self.data_parallel_size > 1:
    comm_overlap_cfg.bucket_size = 128 * 1024 * 1024
    comm_overlap_cfg.overlap_grad_reduce = True
    comm_overlap_cfg.overlap_param_gather = True

Launch-time env tuning:

executor.env_vars["CUDA_DEVICE_MAX_CONNECTIONS"] = str(cuda_device_max_connections)
...
executor.env_vars["NVTE_FWD_LAYERNORM_SM_MARGIN"] = str(self.layernorm_sm_margin)
executor.env_vars["NVTE_BWD_LAYERNORM_SM_MARGIN"] = str(self.layernorm_sm_margin)

Pitfalls

  1. TP overlap silently disables itself if sequence_parallel=False or Transformer Engine is unavailable.
  2. PP overlap is not enabled for all PP cases. Bridge only auto-selects overlap_p2p_comm=True when PP > 1 and VPP > 1.
  3. bucket_size is a parameter-count knob, not a byte-size knob.
  4. grad_reduce_in_fp32 and fp8_param_gather should be set through mixed precision, not as standalone DDP tuning first.
  5. CUDA_DEVICE_MAX_CONNECTIONS and LayerNorm SM margin are launch-time plugin settings, not CommOverlapConfig fields.

Verification

Use the checked-in overlap unit coverage first:

uv run python -m pytest tests/unit_tests/training/test_comm_overlap.py -q

Optional second check if nemo_run is available:

uv run python -m pytest tests/unit_tests/recipes/test_run_plugins.py -q

Success criteria:

  • first command reports 26 passed
  • second command validates plugin-owned env wiring when not skipped

Bundled with this artifact

5 files

Reference files that ship alongside this artifact. Agents pull these in only when the task needs them.

More on the bench

SKILL0

Whisper

OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR.

data-science-ml+2
0
SKILL0

Guidance

Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance - Microsoft Research's constrained generation framework

ai-prompt-engineering+2
0
SKILL0

Pinecone

Managed vector database for production AI applications. Fully managed, auto-scaling, with hybrid search (dense + sparse), metadata filtering, and namespaces. Low latency (<100ms p95). Use for production RAG, recommendation systems, or semantic search at scale. Best for serverless, managed infrastructure.

data-science-ml+2
0