[FP8] Memory Leak and Fragmentation Investigation

# FP8 Memory Leak Investigation & Reproduction Guide

## 1. Executive Summary

We identified that standard FP8 RL suffers from huge memory growth due to uncleared FP8 scaling factors and workspace buffers. While we can fix the leak by manually clearing caches, this leads to memory fragmentation (high reserved memory). Enabling `expandable_segments:True` solves the fragmentation but introduces a ~33% performance penalty.

## 2. Experiment Results (8-Node Qwen 30B)

We compared 4 configurations to isolate the issue. The **"Memory after refit"** snapshot is the critical metric, representing the "clean slate" memory usage between training steps.

| Configuration | Time/Step | Memory (Allocated) | Memory (Reserved) | Status |
| :--- | :--- | :--- | :--- | :--- |
| **[BF16 (Baseline)](https://wandb.ai/nvidia/grpo-dev-zhiyul/runs/bukjq74s/logs)** | **~95s** | **~0.5 GB** | **~5.2 GB** | ✅ **Gold Standard** |
| **[FP8 (Default)](https://wandb.ai/nvidia/grpo-dev-zhiyul/runs/g1657cyh/logs)** | ~87s | ~8.9 GB | **~26.0 GB** | ❌ **Huge Leak** (Both high) |
| **[FP8 + ClearCaches](https://wandb.ai/nvidia/grpo-dev-zhiyul/runs/ulwnfku5/logs)** | ~91s | ~0.6 GB | **~15.5 GB*** | ⚠️ **Fragmentation** (Allocated low, Reserved high) |
| **[FP8 + Clear + Expandable(de-fregmentation flag)](https://wandb.ai/nvidia/grpo-dev-zhiyul/runs/ncwnsmvm/logs)** | ~127s | ~1.1 GB | **~3.0 GB** | ⚠️ **Fixed Leak, but 33% Slower** |

*\*Note: In the `ClearCaches` run, some ranks showed ~3GB reserved while others showed ~15.5GB, indicating severe non-deterministic memory fragmentation.*

## 3. The Issue: Huge Memory Overhead

In the standard FP8 implementation, we observe massive memory usage compared to BF16. This is driven by three factors:
1.  **FP8 Scaling Factors:** Accumulating `int16` tensors that aren't cleared after use. And why it takes additional **8GB** allocated + **20GB** reserved???
2.  **Workspace Buffers:** Temporary buffers for FP8 operations that persist across steps.
3.  **Fragmentation:** Small, frequent allocations cause the CUDA allocator to reserve large blocks of memory that cannot be efficiently reused, leading to OOMs even when "allocated" memory is low.

## 4. Detailed Analysis

### A. FP8 (Default)
*   **Observation:** Both allocated and reserved memory are massive and grow over time.
*   **Root Cause:** FP8 workspaces and scaling factors are not being cleaned up by the framework.

### B. FP8 + `forceClearFP8Caches`
*   **Observation:** `Allocated` memory drops to near baseline (~0.6GB), proving the objects are successfully freed. However, `Reserved` memory remains dangerously high (~15.5GB).
*   **Root Cause:** The CUDA allocator cannot release the fragmented memory blocks back to the pool. The memory is "free" but trapped in unusable chunks.

### C. FP8 + `forceClearFP8Caches` + `expandable_segments:True`
*   **Observation:** Solves the fragmentation. Reserved memory drops to ~3GB (perfect stability).
*   **Trade-off:** **~33% slowdown** (95s → 127s per step).
*   **Hypothesis:** The performance hit comes from the overhead of the allocator managing expandable segments during the high-frequency allocations typical of FP8 training.

## 5. Reproduction Guide

Use the script [`local_qwen_fp8_8node.sh`](https://github.com/NVIDIA-NeMo/RL/blob/zhiyul/fp_memory_cleanup/local_qwen_fp8_8node.sh) to reproduce these states. It is designed to be simple and switchable via environment variables.

* branch: https://github.com/NVIDIA-NeMo/RL/blob/zhiyul/fp_memory_cleanup 
   - close to r0.5.0

```
# interactive mode use ray.sub
CONTAINER=/lustre/fsw/portfolios/coreai/users/zhiyul/enroot-images/nvcr.io#nvidia/nemo-rl:v0.5.0.squashfs\
MOUNTS="$PWD:$PWD" \
sbatch \
    --nodes=8 \
    --account=YOUR_ACCOUNT \
    --job-name=YOUR_JOBNAME \
    --partition=YOUR_PARTITION \
    --time=4:0:0 \
    --gres=gpu:8 \
    ray.sub
```

```
# run the following commands under the container:
bash local_qwen_fp8_8node.sh
```

### Usage Commands

**1. Baseline BF16 (Fast, Low Memory)**
```bash
bash local_qwen_fp8_8node.sh
```

**2. FP8 Default (Fast, High Memory Leak)**
```bash
FP8=true bash local_qwen_fp8_8node.sh
```

**3. FP8 + Fix Leak (Fast, High Fragmentation)**
```bash
FP8=true FORCE_CLEAR_FP8_CACHES=true bash local_qwen_fp8_8node.sh
```

**4. FP8 + Fix Leak + Fix Fragmentation (Slow, Low Memory)**
```bash
FP8=true FORCE_CLEAR_FP8_CACHES=true EXPANDABLE=true bash local_qwen_fp8_8node.sh
```

### How to Verify
Search the logs for the "Refit Complete" memory snapshot. This line appears after model weights are reloaded but before the next training step begins.

```bash
grep "GPU Memory after refit complete" logs_fp_memory/your_log_file
```

## 6. Verification

It would be considerred resolved perfectly if we can reach similar close-to-zero memory allocated (after refit) and reserved with no extra speed costs in fp8.

Configuration	Time/Step	Memory (Allocated)	Memory (Reserved)	Status
BF16 (Baseline)	~95s	~0.5 GB	~5.2 GB	✅ Gold Standard
FP8 (Default)	~87s	~8.9 GB	~26.0 GB	❌ Huge Leak (Both high)
FP8 + ClearCaches	~91s	~0.6 GB	~15.5 GB*	⚠️ Fragmentation (Allocated low, Reserved high)
FP8 + Clear + Expandable(de-fregmentation flag)	~127s	~1.1 GB	~3.0 GB	⚠️ Fixed Leak, but 33% Slower

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FP8] Memory Leak and Fragmentation Investigation #2003

FP8 Memory Leak Investigation & Reproduction Guide

1. Executive Summary

2. Experiment Results (8-Node Qwen 30B)

3. The Issue: Huge Memory Overhead

4. Detailed Analysis

A. FP8 (Default)

B. FP8 + `forceClearFP8Caches`

C. FP8 + `forceClearFP8Caches` + `expandable_segments:True`

5. Reproduction Guide

Usage Commands

How to Verify

6. Verification

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FP8] Memory Leak and Fragmentation Investigation #2003

Description

FP8 Memory Leak Investigation & Reproduction Guide

1. Executive Summary

2. Experiment Results (8-Node Qwen 30B)

3. The Issue: Huge Memory Overhead

4. Detailed Analysis

A. FP8 (Default)

B. FP8 + forceClearFP8Caches

C. FP8 + forceClearFP8Caches + expandable_segments:True

5. Reproduction Guide

Usage Commands

How to Verify

6. Verification

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

B. FP8 + `forceClearFP8Caches`

C. FP8 + `forceClearFP8Caches` + `expandable_segments:True`