Nemo Mbridge Perf Cpu Offloading

Validate and use CPU offloading in Megatron Bridge, including layer-level activation offloading and fractional optimizer state offloading with HybridDeviceOptimizer.

Published by @NVIDIA·0 agent reads / 30d·0 saves·

CPU Offloading

References

  • Stable docs: @docs/training/cpu-offloading.md
  • Structured metadata: @skills/nemo-mbridge-perf-cpu-offloading/card.yaml

What It Is

Two independent mechanisms to move data from GPU to CPU memory:

MechanismConfig namespaceWhat gets offloadedPP restriction
Activation offloadingmodel.cpu_offloading*Activations (and optionally weights) per transformer layerPP must be 1
Optimizer offloadingoptimizer.optimizer_cpu_offloadAdam optimizer states (momentum + variance) via HybridDeviceOptimizerNone

Quick Decision

SituationRecommendation
Large MoE model (30B+), needs PP > 1Optimizer offloading — activation offloading is blocked by PP=1
Small/medium model, PP=1 fits, activation memory dominatesActivation offloading
Want tunable memory-speed tradeoffOptimizer offloading with fractional optimizer_offload_fraction
Throughput is top priorityDon't enable — offloading always adds overhead
CUDA graphs are neededOnly optimizer offloading — activation offloading is incompatible
Memory pressure is moderateOptimizer offload at 25–50% fraction for best efficiency

Enablement

Optimizer CPU offloading (recommended for large models)

cfg.optimizer.optimizer_cpu_offload = True
cfg.optimizer.optimizer_offload_fraction = 1.0
cfg.optimizer.overlap_cpu_optimizer_d2h_h2d = True

CLI overrides:

optimizer.optimizer_cpu_offload=True \
optimizer.optimizer_offload_fraction=0.5 \
optimizer.overlap_cpu_optimizer_d2h_h2d=True

Activation CPU offloading (small/medium models only)

cfg.model.cpu_offloading = True
cfg.model.cpu_offloading_num_layers = 16
cfg.model.cpu_offloading_activations = True
cfg.model.cpu_offloading_weights = False

cfg.model.pipeline_model_parallel_size = 1
cfg.model.recompute_granularity = None
cfg.model.cuda_graph_impl = "none"

Config Parameter Reference

Optimizer offloading

ParameterDefaultDescription
optimizer_cpu_offloadFalseMaster switch
optimizer_offload_fraction0.0Fraction of optimizer states on CPU (0.0–1.0)
overlap_cpu_optimizer_d2h_h2dFalseOverlap GPU↔CPU transfers with compute
use_torch_optimizer_for_cpu_offloadFalseUse torch.optim instead of fused optimizer for CPU portion

Activation offloading

ParameterDefaultDescription
cpu_offloadingFalseMaster switch
cpu_offloading_num_layers0Number of transformer layers to offload (0 to num_layers-1)
cpu_offloading_activationsTrueOffload activations
cpu_offloading_weightsFalseOffload weights
cpu_offloading_double_bufferingFalseDouble-buffer across layers while reloading

Compatibility And Constraints

Activation offloading

  • pipeline_model_parallel_size must be 1
  • recompute_granularity must be None
  • Cannot combine with fine_grained_activation_offloading
  • Cannot combine with CUDA graphs
  • cpu_offloading_num_layers must be in [0, num_layers-1)

Optimizer offloading

  • Requires use_distributed_optimizer = True (default in most recipes)
  • No PP, recompute, or CUDA graph restrictions
  • optimizer_offload_fraction must be in [0.0, 1.0]

Practical: large MoE models

Activation offloading is blocked for Qwen3-30B-A3B and similar large MoE models. The PP=1 constraint means each GPU holds all 48 layers; model weights + optimizer states alone (~70 GB) exceed H100 80 GB capacity.

Minimal Runnable Command

uv run python scripts/training/run_recipe.py \
  --recipe qwen3_30b_a3b_pretrain_config \
  optimizer.optimizer_cpu_offload=True \
  optimizer.optimizer_offload_fraction=0.5 \
  train.train_iters=20 \
  train.global_batch_size=8 \
  train.micro_batch_size=1

Verification

Unit tests

uv run python -m pytest \
  tests/unit_tests/models/test_gpt_full_te_layer_autocast_spec.py -k "cpu_offload" \
  tests/unit_tests/peft/test_utils.py -k "cpu_offload" -q

Success criteria

  • Config validation passes for the selected offloading mode
  • Training completes without OOM or NCCL errors
  • Loss matches the non-offloaded baseline (max delta < 0.001)
  • Memory usage drops proportionally to offload fraction

Code Anchors

MCore activation offload constraints

        if self.cpu_offloading and (
            self.cpu_offloading_num_layers < 0 or self.cpu_offloading_num_layers >= self.num_layers
        ):
            raise ValueError(...)

        if self.cpu_offloading and self.pipeline_model_parallel_size > 1:
            raise ValueError(
                "Currently there is no support for Pipeline parallelism with CPU offloading"
            )

        if self.cpu_offloading and self.recompute_granularity is not None:
            raise ValueError(
                "CPU offloading does not work when activation recomputation is enabled"
            )

MCore CUDA graph incompatibility

            if self.cpu_offloading:
                raise ValueError("CUDA graphs not supported with CPU offloading.")

MCore fine-grained offloading mutual exclusion

        if self.fine_grained_activation_offloading:
            assert (
                not self.cpu_offloading
            ), "fine_grained_activation_offloading cannot be enabled with cpu_offloading."

MCore HybridDeviceOptimizer instantiation

        if config.optimizer_cpu_offload:
            # ... setup cpu/gpu optimizer classes ...
            optimizer = HybridDeviceOptimizer(
                param_groups,
                offload_fraction=config.optimizer_offload_fraction,
                cpu_optimizer_cls=cpu_optimizer_cls,
                gpu_optimizer_cls=gpu_optimizer_cls,
                overlap_cpu_optimizer_d2h_h2d=config.overlap_cpu_optimizer_d2h_h2d,
                pin_cpu_grads=config.pin_cpu_grads,
                pin_cpu_params=config.pin_cpu_params,
            )

Bridge CUDA graph guard

        assert not config.cpu_offloading and config.recompute_granularity is None, "Cudagraphs not supported"

Bridge activation offloading in PEFT

        if self.config.cpu_offloading and self.config.cpu_offloading_activations:
            x.activation_offloading = True
        x, _ = self.linear_in(x)
        x = self.activation(x)
        if self.config.cpu_offloading and self.config.cpu_offloading_activations:
            x.activation_offloading = True
        x, _ = self.linear_out(x)

Failure Diagnosis

SymptomLikely CauseHow To ConfirmFix
Currently there is no support for Pipeline parallelism with CPU offloadingActivation offload + PP > 1Check pipeline_model_parallel_sizeSet PP=1 or use optimizer offloading
CPU offloading does not work when activation recomputation is enabledActivation offload + recomputeCheck recompute_granularitySet recompute_granularity=null
fine_grained_activation_offloading cannot be enabled with cpu_offloadingBoth offloading modes enabledCheck both flagsUse one or the other
CUDA graphs not supported with CPU offloadingCUDA graphs + activation offloadCheck cuda_graph_implSet cuda_graph_impl="none"
OOM with activation offloadingModel too large for PP=1Check allocated memory vs 80 GBUse optimizer offloading with PP > 1
Extreme slowdown (>4x)100% optimizer offload, CPU Adam bottleneckCompare iter time at different fractionsReduce fraction or enable overlap_cpu_optimizer_d2h_h2d
OOM at partial optimizer offloadInsufficient offload for this configCheck memory at different fractionsIncrease fraction or add PP

Known Limitations

  • Activation offloading requires PP=1, making it impractical for large models (30B+ MoE) that need pipeline parallelism.
  • Optimizer offloading throughput penalty scales linearly (~1.9x at 25%, ~4.2x at 100% for Qwen3-30B-A3B).
  • D2H/H2D overlap provides only ~7% speedup because CPU Adam compute is the dominant bottleneck.
  • fine_grained_activation_offloading is a separate module-level approach that works with PP > 1 but cannot be combined with layer-level cpu_offloading.

Bundled with this artifact

5 files

Reference files that ship alongside this artifact. Agents pull these in only when the task needs them.

More on the bench

SKILL0

Whisper

OpenAI's general-purpose speech recognition model. Supports 99 languages, transcription, translation to English, and language identification. Six model sizes from tiny (39M params) to large (1550M params). Use for speech-to-text, podcast transcription, or multilingual audio processing. Best for robust, multilingual ASR.

data-science-ml+2
0
SKILL0

Guidance

Control LLM output with regex and grammars, guarantee valid JSON/XML/code generation, enforce structured formats, and build multi-step workflows with Guidance - Microsoft Research's constrained generation framework

ai-prompt-engineering+2
0
SKILL0

Pinecone

Managed vector database for production AI applications. Fully managed, auto-scaling, with hybrid search (dense + sparse), metadata filtering, and namespaces. Low latency (<100ms p95). Use for production RAG, recommendation systems, or semantic search at scale. Best for serverless, managed infrastructure.

data-science-ml+2
0