Distributed Training in NeMo AutoModel
Purpose
NeMo AutoModel uses PyTorch-native distributed training.
All parallelism is orchestrated through a single MeshContext object that
holds device meshes, strategy configs, and axis names.
Instructions
For conceptual distributed-training questions, answer directly from the quick patterns in this skill without inspecting the repository. Start with the strategy choice, then list only the YAML fields and constraints relevant to the question.
Use direct action verbs in the final answer: recommend the strategy, show the minimal YAML, state the sizing constraint, and name the unsupported strategies. Do not discuss model onboarding, recipes, Slurm, SkyPilot, or checkpointing unless the user asks.
Examples
TP plus PP for a large multi-node model
Recommend strategy: fsdp2. Mention tp_size, pp_size, cp_size,
ep_size, and the pipeline sub-config. State that dp_size is inferred from
world_size / (tp_size * pp_size * cp_size).
distributed:
strategy: fsdp2
tp_size: 8
pp_size: 4
cp_size: 1
ep_size: 1
pipeline:
pp_schedule: interleaved1f1b
pp_microbatch_size: 1
MoE expert parallelism
Recommend strategy: fsdp2 with ep_size > 1. Say this creates a separate
moe_mesh; include the moe sub-config when relevant; state that ep_size
must divide dp_size * cp_size. Do not recommend megatron_fsdp or ddp.
distributed:
strategy: fsdp2
ep_size: 8
moe:
reshard_after_forward: false
MegatronFSDP limitations
Say no for pipeline parallelism, expert parallelism, and sequence_parallel.
Recommend fsdp2 for PP, EP, or sequence_parallel; mention that DDP is only
simple data parallelism.
Strategy Selection
Three strategies are available, selected via the distributed.strategy YAML key:
| Strategy | YAML value | Best for |
|---|---|---|
| FSDP2 | fsdp2 | General use, recommended default. Supports TP, PP, CP, EP, HSDP. |
| MegatronFSDP | megatron_fsdp | NVIDIA Megatron-style FSDP. No PP, no EP, no sequence_parallel. |
| DDP | ddp | Simple data parallelism only. No TP, PP, CP, or EP. |
Decision tree:
- Single GPU: no distributed config needed (FSDP2Manager skips parallelization when world_size=1).
- Multi-GPU single node:
fsdp2(default). Useddponly if you need the simplest possible setup. - Multi-node:
fsdp2with appropriate TP/PP sizing. - MoE models with expert parallelism:
fsdp2withep_size > 1(creates a separatemoe_mesh). - Large models (70B+):
fsdp2with PP + TP. - Long sequences (8K+): add CP (
cp_size > 1).
When answering strategy-selection questions, state the chosen distributed.strategy
first, then enumerate the YAML fields the user must set.
Quick TP + PP answer:
- Use
strategy: fsdp2; do not usemegatron_fsdpwhen pipeline parallelism is required. - Set
tp_sizefor tensor parallelism andpp_sizefor pipeline parallelism. - Add a
pipeline:sub-config withpp_scheduleandpp_microbatch_size. - Leave
dp_sizeunset ornone; it is inferred asworld_size / (tp_size * pp_size * cp_size). - Keep TP inside a fast intra-node domain when possible, and use PP across model depth for 70B+ models.
Quick MoE expert-parallel answer:
- Start with
strategy: fsdp2andep_size > 1. - Include a
moe:sub-config only whenep_size > 1; it maps toMoEParallelizerConfig. - Expect a separate
moe_meshfor expert parallelism in addition to the maindevice_mesh. - Do not recommend
megatron_fsdporddpfor expert parallelism;megatron_fsdphas no EP support. - Before finishing an MoE EP answer, explicitly state that
ep_sizemust dividedp_size * cp_sizeand thatmegatron_fsdpdoes not support EP, PP, orsequence_parallel.
YAML Config Structure
The distributed section in the recipe YAML maps directly to
parse_distributed_section() in recipes/_dist_setup.py:
distributed:
strategy: fsdp2 # fsdp2 | megatron_fsdp | ddp
dp_size: none # auto-calculated from world_size / (tp * pp * cp)
dp_replicate_size: none # FSDP2-only, for HSDP
tp_size: 1
pp_size: 1
cp_size: 1
ep_size: 1
# Strategy-specific flags (forwarded to the strategy dataclass):
sequence_parallel: false
activation_checkpointing: false
defer_fsdp_grad_sync: true # FSDP2 only
# Sub-configs (optional):
pipeline:
pp_schedule: 1f1b
pp_microbatch_size: 1
# ... see PipelineConfig fields
moe:
reshard_after_forward: false
# ... see MoEParallelizerConfig fields
The dp_size is always inferred:
dp_size = world_size / (tp_size * pp_size * cp_size)
Infrastructure Flow
YAML distributed section
-> parse_distributed_section() [recipes/_dist_setup.py]
-> setup_distributed() [recipes/_dist_setup.py]
-> create_device_mesh() [components/distributed/device_mesh.py]
-> MeshContext(...) [components/distributed/mesh.py]
-> instantiate_infrastructure() [_transformers/infrastructure.py]
-> _instantiate_distributed() -> FSDP2Manager / MegatronFSDPManager / DDPManager
-> _instantiate_pipeline() -> AutoPipeline (if pp_size > 1)
-> parallelize_fn -> MoE parallelizer (if ep_size > 1) or PP wrapper
-> apply_model_infrastructure() [_transformers/infrastructure.py]
-> _shard_pp() or _shard_ep_fsdp() (applies sharding to the model)
FSDP2 Configuration
Basic FSDP2 (data parallelism only)
distributed:
strategy: fsdp2
tp_size: 1
cp_size: 1
This auto-calculates dp_size = world_size and applies fully_shard() per
transformer block via DTensor-based sharding.
FSDP2 with Tensor Parallelism
Keep TP within a single NVLink domain (typically one node):
distributed:
strategy: fsdp2
tp_size: 4 # 2, 4, or 8 -- must divide GPUs per node
sequence_parallel: true
The TP plan is auto-selected based on the model type. Pass a custom plan via the Python API if needed:
config = FSDP2Config(sequence_parallel=True, tp_plan=my_custom_plan)
FSDP2 with Pipeline Parallelism
distributed:
strategy: fsdp2
pp_size: 2
pipeline:
pp_schedule: interleaved1f1b # 1f1b, gpipe, interleaved_1f1b, etc.
pp_microbatch_size: 4
scale_grads_in_schedule: false
The model must have a _pp_plan attribute (set on the HF model class) for
AutoPipeline to know how to split layers across stages. Models without
_pp_plan are not compatible with PP.
FSDP2 with HSDP (Hybrid Sharded Data Parallel)
Intra-node full sharding + inter-node replication via a 2D DeviceMesh:
distributed:
strategy: fsdp2
dp_replicate_size: 2 # must divide dp_size
Constraint: dp_replicate_size < dp_size (pure replication with no sharding
is not supported by FSDP2).
Activation Checkpointing
Trades compute for memory by recomputing activations during backward:
distributed:
activation_checkpointing: true
This is forwarded to the strategy config for non-EP models, or read from
MeshContext.activation_checkpointing for EP models.
Gradient Sync Deferral
FSDP2 defers gradient sync to the final micro-batch by default for communication overlap:
distributed:
defer_fsdp_grad_sync: true # default
Mixed Precision
FSDP2Config defaults to bfloat16 for all three precision knobs via
MixedPrecisionPolicy(param_dtype=bf16, reduce_dtype=bf16, output_dtype=bf16, cast_forward_inputs=True). Override via the Python API:
from torch.distributed.fsdp import MixedPrecisionPolicy
config = FSDP2Config(
mp_policy=MixedPrecisionPolicy(param_dtype=torch.float16, reduce_dtype=torch.float32),
)
Pipeline Parallelism
Requirements
- Model class must define
_pp_plan(a dict mapping module FQNs to stages). pp_size > 1in the distributed section.- A
pipelinesub-config with schedule and microbatch size.
Supported schedules
Defined in PipelineConfig.pp_schedule:
1f1b(one-forward-one-backward, default)gpipeinterleaved_1f1b/interleaved1f1blooped_bfsdfsv_schedulezero_bubble
Example (8B model on 8 GPUs, PP=2 + DP=4)
distributed:
strategy: fsdp2
pp_size: 2
pipeline:
pp_schedule: interleaved1f1b
pp_microbatch_size: 4
scale_grads_in_schedule: false
checkpoint:
model_save_format: safetensors
save_consolidated: true
How it works
AutoPipeline.build() calls pipeline_model() which splits the model into
stages using the model's _pp_plan, creates PipelineStage objects, and
builds the schedule. During training, schedule.step() drives forward and
backward through the pipeline.
Context Parallelism
Use CP for long sequences (8K+). CP shards Q/K/V on the sequence dimension as DTensors.
Config
distributed:
strategy: fsdp2
cp_size: 2 # or 4, 8
Requirements
- SDPA (Flash Attention or Efficient Attention backend) or Transformer Engine attention. SDPBackend.MATH is not compatible with DTensor.
- Attention masks are automatically stripped;
is_causal=Trueis set via forward pre-hooks registered byattach_context_parallel_hooks().
How it works
- After model sharding,
apply_model_infrastructure()callsattach_context_parallel_hooks()on each model part (for non-TE models). - At each training step,
make_cp_batch_and_ctx()creates a CP context manager that shards the batch along the sequence dimension and sets upcontext_parallel()fromtorch.distributed.tensor.experimental. - For TE attention models,
make_cp_batch_for_te()uses THD format and TE'sthd_get_partitioned_indicesfor sharding.
CP with Sequence Packing
CP works with packed sequences. The packed_sequence_size must be divisible
by cp_size. When using TE, chunks are sharded per-chunk via
_shard_thd_chunk_for_te().
Sequence Packing
Packing multiple sequences into a single training sample for efficiency.
Config
packed_sequence:
packed_sequence_size: 4096 # 0 = disabled
step_scheduler:
local_batch_size: 1 # must be 1 for packed sequences
When packed_sequence_size > 0, the dataset collator packs sequences up to
that length. local_batch_size must be 1 because each "sample" is already a
packed batch.
MoE Distributed Training
Expert Parallelism
Set ep_size > 1 to distribute experts across GPUs. This creates a separate
moe_mesh alongside the main device_mesh:
distributed:
strategy: fsdp2
ep_size: 8
activation_checkpointing: true
The moe_mesh shape is (pp_size, ep_shard_size, ep_size) with dimension
names ("pp", "ep_shard", "ep").
Constraint: dp_cp_size (= dp_size * cp_size) must be divisible by
ep_size.
MoE sub-config
distributed:
strategy: fsdp2
ep_size: 8
activation_checkpointing: true
moe:
reshard_after_forward: false
ignore_router_for_ac: false
wrap_outer_model: true
The moe sub-section maps to MoEParallelizerConfig and is only
instantiated when ep_size > 1.
Full MoE example (Qwen3-30B-A3B on 8 GPUs)
distributed:
strategy: fsdp2
tp_size: 1
cp_size: 1
pp_size: 1
ep_size: 8
sequence_parallel: false
activation_checkpointing: true
MegatronFSDP limitations
Despite its name, megatron_fsdp does not support expert parallelism
(ep_size > 1), pipeline parallelism (pp_size > 1), or
sequence_parallel. Use fsdp2 for these features.
Parallelism Sizing Guidelines
Dense models
| Model size | TP | PP | CP | Strategy |
|---|---|---|---|---|
| < 3B | 1 | 1 | 1 | FSDP2 (DP only) |
| 3-13B | 2-4 | 1 | 1 | FSDP2 + TP |
| 13-70B | 4-8 | 2-4 | 1 | FSDP2 + TP + PP |
| 70B+ | 8 | 4-8 | 1 | FSDP2 + TP + PP |
| Any + long seq (8K+) | as above | as above | 2-8 | add CP |
MoE models
MoE models need less TP than dense models of similar total parameter count because only a fraction of parameters are active per token. EP is the primary scaling dimension:
| Model | TP | PP | EP | Notes |
|---|---|---|---|---|
| Small MoE (<10B total) | 1 | 1 | 8 | EP only |
| Medium MoE (10-30B total) | 1-2 | 1 | 8 | small TP for shared layers |
| Large MoE (100B+ total) | 1-2 | 4+ | 8-64 | PP for depth, EP for experts |
Hardware topology rules
- TP must stay within a single NVLink domain (one node, typically 8 GPUs).
- Use PP or DP for cross-node scaling.
- TP across InfiniBand degrades throughput severely.
Code Anchors
components/distributed/config.py: FSDP2Config, MegatronFSDPConfig, DDPConfig.components/distributed/mesh.py: MeshContext, strategy map, and mesh sizes.components/distributed/device_mesh.py: device mesh andmoe_meshcreation.components/distributed/pipelining/config.py: PipelineConfig fields.components/moe/config.py: MoEParallelizerConfig and MoEConfig.recipes/_dist_setup.py: YAML parsing and distributed setup.
Pitfalls
-
TP across nodes destroys throughput. Always keep TP within a single NVLink domain. Use PP or DP for cross-node scaling.
-
PP requires
_pp_planon the model class. Not all HF models have this. Checkvalidate_hf_model_for_pipeline_support()before enabling PP. -
PP bubbles reduce GPU utilization. Use interleaved schedules (
interleaved_1f1b) and smaller microbatches to reduce bubble time. -
FSDP2 requires DTensor-aware state dict saving. Use
safetensorswithsave_consolidated: truefor checkpoint compatibility. -
CP requires compatible attention. SDPA (Flash Attention or Efficient Attention) or TE attention only.
SDPBackend.MATHis not compatible with DTensor. -
MoE EP size must evenly divide
dp_size * cp_size. The device mesh creation assertsdp_cp_size % ep_size == 0. -
MegatronFSDP is more limited than FSDP2. It does not support PP (
pp_size > 1), EP (ep_size > 1), orsequence_parallel. TheMeshContextvalidation raises on these combinations. -
DDP supports nothing beyond data parallelism. No TP, PP, CP, EP, or HSDP. Validation raises on any of these.
-
Activation checkpointing increases compute. It saves memory by recomputing activations during backward, but adds ~30% compute overhead.
-
Mixed precision policy must match model expectations. The default bfloat16 policy works for most models. FP16 models may need a custom
MixedPrecisionPolicy. -
packed_sequence_sizemust be divisible bycp_sizewhen using CP with packed sequences. -
dp_replicate_sizeis FSDP2-only. Passing it withmegatron_fsdporddpraises aValueError.
Verification
Run the smallest recipe that exercises the requested strategy. Success means exit code 0, finite loss, no NCCL timeout, and log output matching the expected TP/PP/CP/EP sizes.