Nemo Mbridge Perf Parallelism Strategies

Operational guide for choosing and combining parallelism strategies in Megatron Bridge, including sizing rules, hardware topology mapping, and combined parallelism configuration.

Published by @NVIDIA·from NVIDIA/skills·0 agent reads / 30d·0 saves·

Parallelism Strategy Selection Skill

For stable background on each parallelism type, see:

  • @docs/parallelisms.md
  • @skills/nemo-mbridge-perf-parallelism-strategies/card.yaml

Decision by Model Size

Dense models

Model sizeGPUsRecommended starting point
< 1B1-8DP only
1-10B8-16TP=2-4 + DP
10-70B16-64TP=4-8 + PP=2-4 + DP
70-175B64-256TP=8 + PP=4-8 + DP
175-500B256-1024TP=8 + PP=8-16 + CP=2 + DP

MoE models

MoE parallelism differs from dense models. Because only a fraction of parameters are active per token, TP can often stay at 1 or 2 — the active parameter shard already fits on a single GPU. EP is the primary scaling dimension, with PP handling cross-node layer distribution.

Model (total / active)TPPPEPNotes
OLMoE 7B / 1B118EP only, fits single node
Moonlight 16B / 3B218small TP for shared layers
DeepSeek-V2 236B / 21B1432no TP at all
GLM-4.5 Air 106B / 12B148no TP at all
Qwen3 30B-A3B424
GLM-4.5 355B / 32B2816
Qwen3 235B-A22B4168CP=2 for pretrain
DeepSeek-V3 671B / 37B21664TP=2, not 8
Kimi-K2 1T21632

Key patterns:

  • TP is sized by active params, not total params. A 671B MoE with 37B active needs far less TP than a 70B dense model.
  • EP scales with expert count. Common: EP = num_experts or num_experts / experts_per_gpu.
  • PP handles depth. Large MoE models use PP=8-16 across nodes.
  • ETP (expert tensor parallelism) is rarely used. Llama 4 is an exception (ETP=4).

These are starting points, not hard rules. Always profile the first iteration to verify memory and communication.

Decision by Hardware Topology

Single node with NVLink:

cfg.model.tensor_model_parallel_size = 8

Multiple nodes with InfiniBand:

cfg.model.tensor_model_parallel_size = 8
cfg.model.pipeline_model_parallel_size = N

Limited network (Ethernet):

cfg.model.tensor_model_parallel_size = 4
cfg.model.pipeline_model_parallel_size = M

The stable rule is: keep TP within a single NVLink domain. Use PP or DP for cross-node scaling. TP across nodes is almost always a performance loss.

Decision by Sequence Length

Sequence lengthRecommendation
< 2Kstandard TP + PP + DP
2K-8Kadd SP (sequence_parallel=True)
8K-32Kadd CP=2
32K+add CP=4-8, consider a2a+p2p for large CP

Combined Parallelism Enablement

3D parallelism (TP + PP + DP):

cfg.model.tensor_model_parallel_size = 4
cfg.model.pipeline_model_parallel_size = 4
cfg.model.sequence_parallel = True

4D parallelism (TP + PP + CP + DP):

cfg.model.tensor_model_parallel_size = 8
cfg.model.pipeline_model_parallel_size = 8
cfg.model.context_parallel_size = 2
cfg.model.sequence_parallel = True

MoE with EP + PP (e.g. DeepSeek-V2 236B on 128 GPUs):

cfg.model.tensor_model_parallel_size = 1
cfg.model.pipeline_model_parallel_size = 4
cfg.model.expert_model_parallel_size = 32
cfg.model.sequence_parallel = False

MoE with small TP + PP + EP (e.g. DeepSeek-V3 671B on 256 GPUs):

cfg.model.tensor_model_parallel_size = 2
cfg.model.pipeline_model_parallel_size = 16
cfg.model.expert_model_parallel_size = 64
cfg.model.sequence_parallel = True

DP size is always implicit:

data_parallel_size = world_size / (TP * PP * CP)        # dense path
expert_data_parallel_size = world_size / (PP * EP * ETP) # MoE path

Minimum GPU Count

The minimum GPUs needed to run a config (i.e. with DP=1, EDP=1) is not the product of all parallelism dimensions. The dense path uses a TP*CP-mesh and the MoE path uses an EP*ETP-mesh, and within each PP stage these two meshes share the same set of GPUs — they overlap, they don't multiply. Only PP stages multiply (they're disjoint slices of the model). So:

min_gpus = PP * max(TP * CP, EP * ETP)

Common simplification (WRONG): PP * TP * CP * EP * ETP. This over-allocates GPUs and shows up in many READMEs and slurm sizing tables. Don't propagate it.

The decoupling of attention and MoE parallelism (different mesh shapes for the dense and expert paths sharing the same PP-stage GPUs) is detailed in Pangu Ultra MoE (arXiv:2504.14960).

Examples

ConfigWrong (PP·TP·CP·EP·ETP)Correct (PP·max(TP·CP, EP·ETP))
PP=1, TP=2, CP=1, EP=8, ETP=1168 (1 node)
PP=1, TP=4, CP=1, EP=8, ETP=1328 (max(4, 8))
PP=1, TP=2, CP=2, EP=8, ETP=1328 (max(4, 8))
PP=1, TP=2, CP=4, EP=8, ETP=1648 (max(8, 8))
PP=2, TP=2, CP=1, EP=8, ETP=13216 (2 · max(2, 8))
PP=1, TP=2, CP=1, EP=4, ETP=2168 (max(2, 8))

Scaling above the minimum

Adding GPUs scales DP and/or EDP (the world_size must satisfy both equations simultaneously). At min_gpus the larger-mesh side has DP (or EDP) = 1 and the smaller side absorbs the slack.

Example — TP=2, CP=1, EP=8, ETP=1, PP=1:

  • 8 GPUs (min_gpus): dense DP = 8/2 = 4, MoE EDP = 8/8 = 1
  • 16 GPUs: dense DP = 8, MoE EDP = 2 → 2× global batch
  • 32 GPUs: dense DP = 16, MoE EDP = 4 → 4× global batch

When sizing slurm scripts, compute --nodes from min_gpus (or a multiple of it for higher throughput via DP/EDP).

When answering MoE sizing prompts, include this checklist:

  • compute min_gpus = PP * max(TP * CP, EP * ETP) with the requested values
  • explicitly reject the wrong PP * TP * CP * EP * ETP full product
  • give both DP formulas: dense world_size / (TP * PP * CP) and MoE world_size / (PP * EP * ETP)
  • mention TP topology, SP, CP divisibility, and long-sequence CP guidance

Memory Estimation

Without parallelism (70B model, FP16):

parameters:       140 GB
gradients:        140 GB
optimizer states: 280 GB (Adam)
activations:       48 GB (batch=1, seq=4K)
total:            608 GB

With TP=4, PP=4, DP=4 (64 GPUs):

parameters:        8.75 GB per GPU
gradients:         8.75 GB per GPU
optimizer states: 17.50 GB per GPU
activations:       3.00 GB per GPU
total:           ~38    GB per GPU

Code Anchors

Parallelism dimensions set in model provider:

model_config = GPTModelProvider(
    tensor_model_parallel_size=2,
    # ... other model parameters
)

DP size calculation:

data_parallel_size = world_size / (tensor_model_parallel_size × pipeline_model_parallel_size × context_parallel_size)

Bridge initialization wires parallelism into process groups:

parallel_state.initialize_model_parallel(
    tensor_model_parallel_size=model_config.tensor_model_parallel_size,
    pipeline_model_parallel_size=model_config.pipeline_model_parallel_size,
    ...
    context_parallel_size=model_config.context_parallel_size,
    hierarchical_context_parallel_sizes=model_config.hierarchical_context_parallel_sizes,
    expert_model_parallel_size=model_config.expert_model_parallel_size,
    ...
)

Pitfalls

  1. TP across nodes destroys throughput. Always keep TP within a single NVLink domain.

  2. PP without interleaving has large pipeline bubbles. Use virtual_pipeline_model_parallel_size when possible.

  3. SP requires tensor_model_parallel_size > 1. Enabling SP alone without TP is a config error.

  4. CP requires seq_length % (2 * context_parallel_size) == 0.

  5. EP is only for MoE models. Setting expert_model_parallel_size on a dense model is a no-op or error.

  6. The model-size-to-parallelism table above is a starting heuristic. Always profile the first iteration to check memory and communication.

  7. CUDA_DEVICE_MAX_CONNECTIONS and related env vars interact with overlap settings. See @skills/nemo-mbridge-perf-tp-dp-comm-overlap/SKILL.md.

  8. The minimum GPU count for an MoE config is PP * max(TP*CP, EP*ETP), not the product of all dimensions. The dense TP*CP-mesh and MoE EP*ETP-mesh share the same GPUs in each PP stage. See "Minimum GPU Count" section above.

Verification

Quick sanity check that combined parallelism initializes correctly using the smallest available recipe with overridden parallelism:

CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python -m torch.distributed.run --nproc_per_node=4 \
  scripts/training/run_recipe.py \
  --recipe llama32_1b_pretrain_config \
  model.tensor_model_parallel_size=2 \
  model.pipeline_model_parallel_size=2 \
  model.sequence_parallel=True \
  train.train_iters=3 train.global_batch_size=8 train.micro_batch_size=1 \
  scheduler.lr_warmup_iters=0 \
  validation.eval_iters=0 validation.eval_interval=0 \
  checkpoint.save_interval=0 \
  logger.log_interval=1

Success criteria:

  • exit code 0
  • finite loss at iteration 3 (e.g. lm loss: 1.003808E+01)
  • log shows TP=2 PP=2 DP=1 layout with 4 ranks

Bundled with this artifact

5 files

Reference files that ship alongside this artifact. Agents pull these in only when the task needs them.

More on the bench

SKILL0

Whisper

OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR.

data-science-ml+2
0
SKILL0

Guidance

Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance - Microsoft Research's constrained generation framework

ai-prompt-engineering+2
0
SKILL0

Pinecone

Managed vector database for production AI applications. Fully managed, auto-scaling, with hybrid search (dense + sparse), metadata filtering, and namespaces. Low latency (<100ms p95). Use for production RAG, recommendation systems, or semantic search at scale. Best for serverless, managed infrastructure.

data-science-ml+2
0