nemo-mbridge-perf-memory-tuning

Techniques for reducing peak GPU memory in Megatron Bridge — expandable segments, parallelism resizing, activation recompute, CPU offloading constraints, and common OOM fixes.

Skill file

Preview skill file↓↑

---
name: nemo-mbridge-perf-memory-tuning
description: Techniques for reducing peak GPU memory in Megatron Bridge — expandable segments, parallelism resizing, activation recompute, CPU offloading constraints, and common OOM fixes.
license: Apache-2.0
when_to_use: GPU OOM errors, reducing peak memory, or tracing an OOM regression to a specific commit or config change; 'out of memory', 'OOM', 'memory fragmentation', 'expandable_segments', 'reduce GPU memory', 'PYTORCH_CUDA_ALLOC_CONF'.
---

# Memory Tuning

Stable docs: @docs/parallelisms.md
Card: @skills/nemo-mbridge-perf-memory-tuning/card.yaml

## What It Is

GPU OOM failures during training often stem from memory **fragmentation** rather
than raw capacity.  PyTorch's default CUDA allocator can leave unusable gaps
between allocations.  The single most effective fix is:

```bash
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
```

This tells PyTorch to use expandable (non-fixed-size) memory segments, which
dramatically reduces fragmentation and often eliminates borderline OOM without
any model or parallelism changes.

Beyond fragmentation, actual peak memory is determined by:

- **Parameter + optimizer state memory** — controlled by TP, PP, DP sharding
  (distributed optimizer, FSDP)
- **Activation memory** — controlled by activation recompute, sequence length,
  micro-batch size
- **Temporary / workspace memory** — CUDA kernels, NCCL buffers, CUDA graphs

For configuration planning, use the Bridge theoretical estimator before launching
large jobs:

```python
from megatron.bridge.training.utils.theoretical_memory_utils import estimate_training_memory

estimate = estimate_training_memory(cfg, num_microbatches=num_microbatches)
```

The estimator reports the most-loaded GPU shard and separates dense/embedding,
routed MoE expert, and activation components. It does not include allocator
fragmentation, CUDA/NCCL workspace, CUDA graph buffers, token imbalance, or
dispatcher workspace, so validate final configs with runtime memory metrics.

## Quick Decision

When a training run OOMs or is close to the memory limit:

1. **Set `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` first.** This fixes
   fragmentation-induced OOM with zero performance cost. Most Slurm launch
   templates already include it.
2. **Add selective activation recompute** (`recompute_modules=[core_attn]`) if
   not already enabled. See @skills/nemo-mbridge-perf-activation-recompute/SKILL.md.
3. **Avoid increasing TP** as a memory fix — doubling TP dramatically increases
   NVLink all-reduce volume and often kills throughput (-28% on Llama3 70B).
4. **Avoid increasing PP at the cost of DP** — halving DP doubles gradient
   accumulation steps and hurts throughput (~6%).
5. Consider `mlp` recompute if still OOM. Saves ~3 GB but costs ~16% GPU
   utilization on large dense models (Llama3 70B).
6. CPU offloading is **blocked when PP > 1**.

## Enablement

### Expandable segments (recommended first step)

Set in the job's environment before launching:

```bash
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
```

In Slurm scripts this is typically placed alongside other env vars:

```bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NVTE_ALLOW_NONDETERMINISTIC_ALGO=1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
```

No model config changes needed. Zero throughput cost.

### Parallelism resizing

If the model genuinely does not fit (not fragmentation), adjust parallelism:

| Strategy | Memory effect | Throughput cost | Notes |
|---|---|---|---|
| Increase PP (keeping DP) | Fewer layers per stage | Moderate (~6% if DP halved) | Only if GPU count allows |
| Increase TP | Fewer params per GPU | Severe (-28% on 70B) | Last resort |
| Distributed optimizer | Shards optimizer state across DP ranks | ~1-2% | Recommended for large models |
| FSDP | Shards params + grads + optimizer | Varies | See @skills/nemo-mbridge-perf-megatron-fsdp/SKILL.md |

### Activation recompute

See @skills/nemo-mbridge-perf-activation-recompute/SKILL.md for full details.

### CPU offloading

```python
cfg.model.cpu_offloading = True
```

**Incompatible with PP > 1.** Only usable when `pipeline_model_parallel_size = 1`.

## A Note on VPP

Virtual pipeline parallelism (VPP) is primarily a **throughput** optimization
that reduces pipeline bubble overhead by interleaving smaller model chunks. Its
effect on peak memory is minimal — changing VPP does not meaningfully change
the total activation, parameter, or optimizer memory on a GPU.

In earlier experiments we incorrectly attributed an OOM fix to VPP tuning
(VPP 5→10). The actual fix was `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`
which eliminated memory fragmentation. The VPP=10 run actually used slightly
**more** peak memory (60.2 GB vs 58.8 GB) but did not OOM because expandable
segments prevented fragmentation.

VPP should be tuned for pipeline bubble reduction (see @docs/parallelisms.md),
not as a memory fix.

## Compatibility and Constraints

- `expandable_segments:True` is incompatible with `--use-nccl-ub` (NCCL
  user-buffer registration). See Megatron-FSDP docs.
- When using CUDA graphs with `expandable_segments:True`, set
  `NCCL_GRAPH_REGISTER=0` (required on pre-Blackwell GPUs, enforced by MCore
  `CudaGraphManager`).
- CPU offloading requires `pipeline_model_parallel_size = 1`.
- Distributed optimizer requires `use_distributed_optimizer = True` in the
  optimizer config.

## Measured Results

Llama3 70B SFT on 32x H100 80GB, FP8 (Current Scaling):
- Baseline: TP=4, PP=4, VPP=5, DP=2, MBS=1, GBS=32, seq_len=4096
- Golden GPU utilization: 709.93 TFLOP/s/GPU
- Regression threshold: 5%

### Strategy comparison: parallelism changes for memory reduction

| Experiment | TP | PP | VPP | DP | TFLOP/s/GPU | vs Golden | Peak Mem (GB) | Result |
|---|---|---|---|---|---|---|---|---|
| Baseline | 4 | 4 | 5 | 2 | ~704 | -0.8% | 58.8 | OOM (fragmentation) |
| More PP | 4 | 8 | 5 | 1 | 668.0 | -5.9% | 53.2 | Borderline perf |
| More TP | 8 | 4 | 5 | 1 | 508.7 | -28.4% | 50.2 | Severe regression |
| Baseline + expandable_segments | 4 | 4 | 5 | 2 | ~704 | -0.8% | ~59 | **Passed** |

Key takeaways:

- **`expandable_segments:True` is the winner.** The baseline OOM was caused by
  memory fragmentation, not insufficient capacity. Setting this env var
  eliminated the OOM with zero throughput cost and no parallelism changes.
- **PP=8 works for memory but loses DP** (2→1), meaning 32 gradient accumulation
  steps per batch, which hurts throughput by ~6%.
- **TP=8 is catastrophic** (-28%) because doubling TP increases all-reduce
  communication volume proportionally across NVLink, and DP=1 means no
  micro-batch overlap.

### CPU offloading: blocked

| Experiment | offload_layers | Result |
|---|---|---|
| Exp 4 | 2 | Incompatible (PP > 1) |
| Exp 5 | 4 | Incompatible (PP > 1) |
| Exp 6 | 6 | Incompatible (PP > 1) |

`ValueError: Currently there is no support for Pipeline parallelism with CPU
offloading.` This approach is blocked for any model using PP > 1.

### Activation recompute: expensive alternative

Selective activation recompute with `mlp` saved ~3 GB peak memory but cost
~16% GPU utilization on this workload. See
@skills/nemo-mbridge-perf-activation-recompute/SKILL.md for full results.

## Code Anchors

### CPU offloading PP incompatibility (MCore)

```1303:1306:3rdparty/Megatron-LM/megatron/core/transformer/transformer_config.py
        if self.cpu_offloading and self.pipeline_model_parallel_size > 1:
            raise ValueError(
                "Currently there is no support for Pipeline parallelism with CPU offloading"
            )
```

### VPP config and layer divisibility validation (MCore)

```1581:1592:3rdparty/Megatron-LM/megatron/core/transformer/transformer_config.py
            if pipeline_parallel_size and self.virtual_pipeline_model_parallel_size is not None:
                num_layers_per_middle_pipeline_rank = num_layers // pipeline_parallel_size
                if (
                    not num_layers_per_middle_pipeline_rank
                    % self.virtual_pipeline_model_parallel_size
                    == 0
                ):
                    raise ValueError(
                        f"number of layers on each middle pipeline rank:"
                        f"{num_layers_per_middle_pipeline_rank} must be divisible by virtual"
                        f"pipeline parallel degree {self.virtual_pipeline_model_parallel_size}"
                    )
```

### Parallelism docs on interleaved pipeline schedule

```116:124:docs/parallelisms.md
To minimize the pipeline bubble, the computation on each GPU can be divided into multiple subsets of layers (referred to as model chunks), rather than a single contiguous block. Enable this by setting `virtual_pipeline_model_parallel_size`:

model_config = GPTModelProvider(
    pipeline_model_parallel_size=4,
    virtual_pipeline_model_parallel_size=2,  # 2 model chunks per pipeline stage
    # ... other model parameters
)
```

## Failure Diagnosis

| Symptom | Cause | Confirm | Fix |
|---|---|---|---|
| OOM on a single rank despite headroom on others | Memory fragmentation | check if `expandable_segments:True` is set | set `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` |
| OOM with `expandable_segments` already set | Genuine capacity limit | check `nvidia-smi` for param/optimizer memory | increase PP, use distributed optimizer, or add recompute |
| Estimated memory exceeds GPU capacity before launch | model state or activations genuinely too large | run `estimate_training_memory` and inspect the largest component | adjust PP/TP/CP/EP, distributed optimizer, or recompute before launching |
| `ValueError: PP + CPU offloading` | using cpu_offloading with PP > 1 | check PP config | disable CPU offloading or set PP=1 |
| `RuntimeError` with `--use-nccl-ub` + expandable segments | NCCL UB incompatible with expandable allocator | check env vars | remove `expandable_segments:True` or disable `--use-nccl-ub` |

## Known Limitations

- CPU offloading is blocked when PP > 1
- Parallelism resizing (TP/PP) often has significant throughput costs
- The theoretical estimator is formula-based and does not replace runtime
  profiling or CUDA memory reports

## Verification

Quick check that `expandable_segments:True` is active:

```python
import os
assert "expandable_segments:True" in os.environ.get("PYTORCH_CUDA_ALLOC_CONF", "")
```

For Slurm jobs, verify the env var is exported before the training command
in the launch script.

Source

Creator's repository · nvidia/skills

View on GitHub ↗

License: Apache-2.0

Security

Security checks in progress

Results will appear here once audits complete

What this skill can do

Reads your filesConnects to the internetRuns code on your machine

Checked by 3 independent security firms

Does it try to trick the AI?Not yet checkedPending · Gen Agent Trust Hub

Does it sneak in hidden code?Not yet checkedPending · Socket

Does it have known bugs?Not yet checkedPending · Snyk