Nemo Mbridge Perf Megatron Fsdp

Operational guide for enabling Megatron FSDP in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.

Published by @NVIDIA·from NVIDIA/skills·0 agent reads / 30d·0 saves·

Megatron FSDP Skill

For stable background and recommendation level, see:

  • @docs/training/megatron-fsdp.md
  • @skills/nemo-mbridge-perf-megatron-fsdp/card.yaml

Enablement

Minimal Megatron FSDP override in Bridge:

cfg.dist.use_megatron_fsdp = True
cfg.ddp.use_megatron_fsdp = True
cfg.ddp.data_parallel_sharding_strategy = "optim_grads_params"
cfg.ddp.average_in_collective = False
cfg.checkpoint.ckpt_format = "fsdp_dtensor"

Example recipe fixup:

cfg = llama3_8b_pretrain_config()
cfg.dist.use_megatron_fsdp = True
cfg.ddp.use_megatron_fsdp = True
cfg.ddp.data_parallel_sharding_strategy = "optim_grads_params"
cfg.ddp.average_in_collective = False
cfg.checkpoint.ckpt_format = "fsdp_dtensor"
cfg.checkpoint.save = "/tmp/fsdp_ckpts"
cfg.checkpoint.load = None

Performance harness note:

python scripts/performance/launch.py --use_megatron_fsdp true

Code Anchors

Bridge config definition:

use_megatron_fsdp: bool = False
"""Use Megatron's Fully Sharded Data Parallel. Cannot be used together with use_torch_fsdp2."""

use_torch_fsdp2: bool = False
"""Use the torch FSDP2 implementation. FSDP2 is not currently working with Pipeline Parallel.
It is still not in a stable release stage, and may therefore contain bugs or other
potential issues."""

Bridge validation:

if self.dist.use_megatron_fsdp and self.dist.use_torch_fsdp2:
    raise ValueError(...)
...
assert not self.dist.use_tp_pp_dp_mapping, "use_tp_pp_dp_mapping is not supported with Megatron FSDP"
...
assert self.checkpoint.ckpt_format == "fsdp_dtensor", (
    "Megatron FSDP only supports fsdp_dtensor checkpoint format"
)

Runtime wrapper selection:

if use_megatron_fsdp:
    DP = FullyShardedDataParallel
elif use_torch_fsdp2:
    DP = TorchFullyShardedDataParallel
else:
    DP = DistributedDataParallel
...
DP(
    config=get_model_config(model_chunk),
    ddp_config=ddp_config,
    module=model_chunk,
    ...
    pg_collection=pg_collection,
)

Perf harness overrides:

recipe.ddp.use_megatron_fsdp = True
recipe.ddp.data_parallel_sharding_strategy = "optim_grads_params"
recipe.ddp.keep_fp8_transpose_cache = False
recipe.ddp.average_in_collective = False
...
recipe.checkpoint.load = None

Pitfalls

  1. Public recipes often expose use_megatron_fsdp but still default to ckpt_format="torch_dist". If save/load is enabled, switch to fsdp_dtensor.
  2. use_torch_fsdp2 exists, but on the validated branch Bridge still fails before training because _ddp_wrap passes pg_collection.
  3. CPU offloading is only valid when pipeline_model_parallel_size == 1 and activation recomputation is disabled.
  4. Upstream warns that FSDP and TP/CP can want different CUDA_DEVICE_MAX_CONNECTIONS settings on Hopper and earlier.
  5. Megatron FSDP and FSDP2 are mutually exclusive.

Verification

Use the existing 2-GPU functional smoke test:

CUDA_VISIBLE_DEVICES=0,1 uv run python -m torch.distributed.run --nproc_per_node=2 \
  -m pytest tests/functional_tests/training/test_megatron_fsdp.py::TestMegatronFSDP::test_fsdp_pretrain_basic -v -s

Success criteria:

  • Pytest reports 1 passed
  • The log shows finite loss at the last iteration
  • The run finishes without a checkpoint format assertion

Bundled with this artifact

5 files

Reference files that ship alongside this artifact. Agents pull these in only when the task needs them.

More on the bench

SKILL0

Whisper

OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR.

data-science-ml+2
0
SKILL0

Guidance

Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance - Microsoft Research's constrained generation framework

ai-prompt-engineering+2
0
SKILL0

Pinecone

Managed vector database for production AI applications. Fully managed, auto-scaling, with hybrid search (dense + sparse), metadata filtering, and namespaces. Low latency (<100ms p95). Use for production RAG, recommendation systems, or semantic search at scale. Best for serverless, managed infrastructure.

data-science-ml+2
0