What&#x27;s on the bench.

Nemo Mbridge Perf Moe Vlm Training

Practical guidance for training MoE VLMs in Megatron Bridge. Compares FSDP and 3D-parallel approaches, using rounded lessons from Qwen3-VL, Qwen3-Next, and other multimodal experiments.

Nemo Mbridge Perf Moe Optimization Workflow

Systematic workflow for MoE training optimization in Megatron Bridge, based on the Megatron-Core MoE paper. Covers the Three Walls framework, parallel folding, recompute strategy, dispatcher choice, and CUDA-graph bring-up.

Nemo Mbridge Perf Moe Long Context

Long-context MoE training guidance for Megatron Bridge. Covers CP sizing, selective recompute, dispatcher choices, and practical patterns from DSV3, Qwen3, and Qwen3-Next long-context experiments.

Nemo Mbridge Perf Moe Hardware Configs

Representative MoE training playbooks by hardware platform and model family. Summarizes rounded throughput bands, parallelism patterns, and common tuning stacks.

Nemo Mbridge Perf Moe Dispatcher Selection

Choose the right MoE token dispatcher (`alltoall`, DeepEP, or HybridEP) for the hardware, EP degree, and optimization stage. Summarizes patterns from DSV3, Qwen3, Qwen3-Next, and VLM bring-up work.

Nemo Mbridge Perf Moe Comm Overlap

MoE expert-parallel communication overlap in Megatron Bridge. Covers dispatch/combine overlap, flex dispatcher backends, and expert wgrad scheduling.

Nemo Mbridge Perf Memory Tuning

Techniques for reducing peak GPU memory in Megatron Bridge — expandable segments, parallelism resizing, activation recompute, CPU offloading constraints, and common OOM fixes.

Nemo Mbridge Perf Megatron Fsdp

Operational guide for enabling Megatron FSDP in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.

Nemo Mbridge Perf Hierarchical Context Parallel

Operational guide for enabling hierarchical context parallelism in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.

Nemo Mbridge Perf Expert Parallel Overlap

Validate and use MoE expert-parallel communication overlap in Megatron-Bridge, including overlap_moe_expert_parallel_comm, delay_wgrad_compute, and flex dispatcher backends such as DeepEP and HybridEP.

Nemo Mbridge Perf Cuda Graphs

Validate and use CUDA graph capture in Megatron Bridge, including local full-iteration graphs and Transformer Engine scoped graphs for attention, MLP, and MoE modules.

Nemo Mbridge Perf Cpu Offloading

Validate and use CPU offloading in Megatron Bridge, including layer-level activation offloading and fractional optimizer state offloading with HybridDeviceOptimizer.

Nemo Mbridge Perf Activation Recompute

Validate and use selective and full activation recompute in Megatron Bridge to reduce GPU memory usage at the cost of extra compute.

Nemo Mbridge Multi Node Slurm

Convert single-node scripts to multi-node Slurm sbatch jobs and debug common multi-node failures. Covers srun-native vs uv run torch.distributed approaches, container setup, NCCL timeouts, OOM sizing for MoE models, and interactive allocation.

Nemo Mbridge Mlm Bridge Training

Run Megatron-LM (MLM) and Megatron Bridge training with mock or real data. Covers correlation testing, available recipes, and multi-GPU examples.

Nemo Evaluator Plugin

Use when working on the Evaluator plugin CLI, jobs, SDK-backed specs, metric types, or plugin-owned Evaluator skills.

Nemo Data Designer Plugin

Use when the user wants to create a dataset, generate synthetic data, or build a data generation pipeline.

Nemo Automodel Recipe Development

Create and modify NeMo AutoModel training and evaluation recipes, including YAML structure, builders, and execution flow.

Nemo Automodel Model Onboarding

Guide for onboarding new model architectures into NeMo AutoModel, including architecture discovery, implementation patterns, registration, and validation.

Nemo Automodel Launcher Config

Configure NeMo AutoModel job launches for interactive runs, Slurm clusters, and SkyPilot cloud execution.

Nemo Automodel Distributed Training

Guide for selecting and configuring distributed training strategies in NeMo AutoModel, including FSDP2, Megatron FSDP, DDP, and parallelism settings.

Mcore Testing

Test system for Megatron-LM. Covers test layout, recipe YAML structure, adding and running unit and functional tests, golden values, marker filters, and CI parity.

Mcore Split Pr

Split a PR into multiple PRs to reduce the number of required CODEOWNERS reviewer groups.

Mcore Run On Slurm

How to launch distributed Megatron-LM training jobs on a SLURM cluster. Covers a minimal sbatch skeleton, environment-variable setup for torch.distributed.run, CUDA_DEVICE_MAX_CONNECTIONS rules across hardware and parallelism modes, container conventions, monitoring, and per-rank failure diagnosis.