Skip to content

[FP8] Memory Leak and Fragmentation Investigation #2003

@ZhiyuLi-Nvidia

Description

@ZhiyuLi-Nvidia

FP8 Memory Leak Investigation & Reproduction Guide

1. Executive Summary

We identified that standard FP8 RL suffers from huge memory growth due to uncleared FP8 scaling factors and workspace buffers. While we can fix the leak by manually clearing caches, this leads to memory fragmentation (high reserved memory). Enabling expandable_segments:True solves the fragmentation but introduces a ~33% performance penalty.

2. Experiment Results (8-Node Qwen 30B)

We compared 4 configurations to isolate the issue. The "Memory after refit" snapshot is the critical metric, representing the "clean slate" memory usage between training steps.

Configuration Time/Step Memory (Allocated) Memory (Reserved) Status
BF16 (Baseline) ~95s ~0.5 GB ~5.2 GB Gold Standard
FP8 (Default) ~87s ~8.9 GB ~26.0 GB Huge Leak (Both high)
FP8 + ClearCaches ~91s ~0.6 GB ~15.5 GB* ⚠️ Fragmentation (Allocated low, Reserved high)
FP8 + Clear + Expandable(de-fregmentation flag) ~127s ~1.1 GB ~3.0 GB ⚠️ Fixed Leak, but 33% Slower

*Note: In the ClearCaches run, some ranks showed ~3GB reserved while others showed ~15.5GB, indicating severe non-deterministic memory fragmentation.

3. The Issue: Huge Memory Overhead

In the standard FP8 implementation, we observe massive memory usage compared to BF16. This is driven by three factors:

  1. FP8 Scaling Factors: Accumulating int16 tensors that aren't cleared after use. And why it takes additional 8GB allocated + 20GB reserved???
  2. Workspace Buffers: Temporary buffers for FP8 operations that persist across steps.
  3. Fragmentation: Small, frequent allocations cause the CUDA allocator to reserve large blocks of memory that cannot be efficiently reused, leading to OOMs even when "allocated" memory is low.

4. Detailed Analysis

A. FP8 (Default)

  • Observation: Both allocated and reserved memory are massive and grow over time.
  • Root Cause: FP8 workspaces and scaling factors are not being cleaned up by the framework.

B. FP8 + forceClearFP8Caches

  • Observation: Allocated memory drops to near baseline (~0.6GB), proving the objects are successfully freed. However, Reserved memory remains dangerously high (~15.5GB).
  • Root Cause: The CUDA allocator cannot release the fragmented memory blocks back to the pool. The memory is "free" but trapped in unusable chunks.

C. FP8 + forceClearFP8Caches + expandable_segments:True

  • Observation: Solves the fragmentation. Reserved memory drops to ~3GB (perfect stability).
  • Trade-off: ~33% slowdown (95s → 127s per step).
  • Hypothesis: The performance hit comes from the overhead of the allocator managing expandable segments during the high-frequency allocations typical of FP8 training.

5. Reproduction Guide

Use the script local_qwen_fp8_8node.sh to reproduce these states. It is designed to be simple and switchable via environment variables.

# interactive mode use ray.sub
CONTAINER=/lustre/fsw/portfolios/coreai/users/zhiyul/enroot-images/nvcr.io#nvidia/nemo-rl:v0.5.0.squashfs\
MOUNTS="$PWD:$PWD" \
sbatch \
    --nodes=8 \
    --account=YOUR_ACCOUNT \
    --job-name=YOUR_JOBNAME \
    --partition=YOUR_PARTITION \
    --time=4:0:0 \
    --gres=gpu:8 \
    ray.sub
# run the following commands under the container:
bash local_qwen_fp8_8node.sh

Usage Commands

1. Baseline BF16 (Fast, Low Memory)

bash local_qwen_fp8_8node.sh

2. FP8 Default (Fast, High Memory Leak)

FP8=true bash local_qwen_fp8_8node.sh

3. FP8 + Fix Leak (Fast, High Fragmentation)

FP8=true FORCE_CLEAR_FP8_CACHES=true bash local_qwen_fp8_8node.sh

4. FP8 + Fix Leak + Fix Fragmentation (Slow, Low Memory)

FP8=true FORCE_CLEAR_FP8_CACHES=true EXPANDABLE=true bash local_qwen_fp8_8node.sh

How to Verify

Search the logs for the "Refit Complete" memory snapshot. This line appears after model weights are reloaded but before the next training step begins.

grep "GPU Memory after refit complete" logs_fp_memory/your_log_file

6. Verification

It would be considerred resolved perfectly if we can reach similar close-to-zero memory allocated (after refit) and reserved with no extra speed costs in fp8.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions