-
Notifications
You must be signed in to change notification settings - Fork 260
Description
FP8 Memory Leak Investigation & Reproduction Guide
1. Executive Summary
We identified that standard FP8 RL suffers from huge memory growth due to uncleared FP8 scaling factors and workspace buffers. While we can fix the leak by manually clearing caches, this leads to memory fragmentation (high reserved memory). Enabling expandable_segments:True solves the fragmentation but introduces a ~33% performance penalty.
2. Experiment Results (8-Node Qwen 30B)
We compared 4 configurations to isolate the issue. The "Memory after refit" snapshot is the critical metric, representing the "clean slate" memory usage between training steps.
| Configuration | Time/Step | Memory (Allocated) | Memory (Reserved) | Status |
|---|---|---|---|---|
| BF16 (Baseline) | ~95s | ~0.5 GB | ~5.2 GB | ✅ Gold Standard |
| FP8 (Default) | ~87s | ~8.9 GB | ~26.0 GB | ❌ Huge Leak (Both high) |
| FP8 + ClearCaches | ~91s | ~0.6 GB | ~15.5 GB* | |
| FP8 + Clear + Expandable(de-fregmentation flag) | ~127s | ~1.1 GB | ~3.0 GB |
*Note: In the ClearCaches run, some ranks showed ~3GB reserved while others showed ~15.5GB, indicating severe non-deterministic memory fragmentation.
3. The Issue: Huge Memory Overhead
In the standard FP8 implementation, we observe massive memory usage compared to BF16. This is driven by three factors:
- FP8 Scaling Factors: Accumulating
int16tensors that aren't cleared after use. And why it takes additional 8GB allocated + 20GB reserved??? - Workspace Buffers: Temporary buffers for FP8 operations that persist across steps.
- Fragmentation: Small, frequent allocations cause the CUDA allocator to reserve large blocks of memory that cannot be efficiently reused, leading to OOMs even when "allocated" memory is low.
4. Detailed Analysis
A. FP8 (Default)
- Observation: Both allocated and reserved memory are massive and grow over time.
- Root Cause: FP8 workspaces and scaling factors are not being cleaned up by the framework.
B. FP8 + forceClearFP8Caches
- Observation:
Allocatedmemory drops to near baseline (~0.6GB), proving the objects are successfully freed. However,Reservedmemory remains dangerously high (~15.5GB). - Root Cause: The CUDA allocator cannot release the fragmented memory blocks back to the pool. The memory is "free" but trapped in unusable chunks.
C. FP8 + forceClearFP8Caches + expandable_segments:True
- Observation: Solves the fragmentation. Reserved memory drops to ~3GB (perfect stability).
- Trade-off: ~33% slowdown (95s → 127s per step).
- Hypothesis: The performance hit comes from the overhead of the allocator managing expandable segments during the high-frequency allocations typical of FP8 training.
5. Reproduction Guide
Use the script local_qwen_fp8_8node.sh to reproduce these states. It is designed to be simple and switchable via environment variables.
- branch: https://github.com/NVIDIA-NeMo/RL/blob/zhiyul/fp_memory_cleanup
- close to r0.5.0
# interactive mode use ray.sub
CONTAINER=/lustre/fsw/portfolios/coreai/users/zhiyul/enroot-images/nvcr.io#nvidia/nemo-rl:v0.5.0.squashfs\
MOUNTS="$PWD:$PWD" \
sbatch \
--nodes=8 \
--account=YOUR_ACCOUNT \
--job-name=YOUR_JOBNAME \
--partition=YOUR_PARTITION \
--time=4:0:0 \
--gres=gpu:8 \
ray.sub
# run the following commands under the container:
bash local_qwen_fp8_8node.sh
Usage Commands
1. Baseline BF16 (Fast, Low Memory)
bash local_qwen_fp8_8node.sh2. FP8 Default (Fast, High Memory Leak)
FP8=true bash local_qwen_fp8_8node.sh3. FP8 + Fix Leak (Fast, High Fragmentation)
FP8=true FORCE_CLEAR_FP8_CACHES=true bash local_qwen_fp8_8node.sh4. FP8 + Fix Leak + Fix Fragmentation (Slow, Low Memory)
FP8=true FORCE_CLEAR_FP8_CACHES=true EXPANDABLE=true bash local_qwen_fp8_8node.shHow to Verify
Search the logs for the "Refit Complete" memory snapshot. This line appears after model weights are reloaded but before the next training step begins.
grep "GPU Memory after refit complete" logs_fp_memory/your_log_file6. Verification
It would be considerred resolved perfectly if we can reach similar close-to-zero memory allocated (after refit) and reserved with no extra speed costs in fp8.