Skip to content

Improving memory requirements on AMDGPU#885

Merged
luraess merged 9 commits intoJuliaGPU:masterfrom
neoblizz:neoblizz/memory-improvements
Mar 4, 2026
Merged

Improving memory requirements on AMDGPU#885
luraess merged 9 commits intoJuliaGPU:masterfrom
neoblizz:neoblizz/memory-improvements

Conversation

@neoblizz
Copy link
Contributor

@neoblizz neoblizz commented Feb 23, 2026

@luraess reviews would be appreciated!

  • Sets eager_gc to true by default.
  • The maybe_collect() function proactively triggers GC.gc(false) when GPU memory pressure exceeds ~75%, preventing excessive pool growth.
  • Allocation statistics and pool tracking.
  • Background pool cleanup every 60 seconds.
  • reclaim() function, supports trimming the pool and returns the bytes cleaned.

Testing Plan & Summary

Ran all tests locally on gfx942 (MI300X):

  • 16,717 passed
  • 13 failed
  • 13 errors
  • 30 broken
  • 16,773 total (5m 42.5s)
Test Failure Reason
wmma_tests ArgumentError: invalid base 10 digit ':' in "942:sramecc+:xnack-" Pre-existing bug parsing the gfx942 architecture string (Fixed: 93fe6c6)
hip_rocarray/solver dA \ dB ≈ Af \ Bf -- numerical inaccuracy in linear solvers Unrelated rocSOLVER numerical precision issue (produces wildly wrong results like 1e7 vs 0.05)
gpuarrays/broadcasting Float16 broadcast comparison failure Should also be fixed with: 93fe6c6

Disclaimer: Parts of this PR are developed using claude-4.6-opus AI.

@neoblizz
Copy link
Contributor Author

neoblizz commented Feb 23, 2026

Reran it after the last fix;

  • 1 failed -- the single failure is broadcast Float16 precision issue at GPUArrays/test/testsuite/broadcasting.jl:131 (pre-existing)
  • 12 failed, 12 errors -- all failures are in Float16 \ matrix division (rocSOLVER numerical precision issue, pre-existing)

@gbaraldi
Copy link
Member

Can you split the arch thing into a separate PR?

@gbaraldi
Copy link
Member

I guess the main idea is what is motivating this? Changing the whole memory allocation logic should have some benchmarks/motivation associated with the PR.
I don't want to discourage work on this BTW.

Also FYI if this was written with AI help please do mention it, we have nothing against AI if the user is on the loop, but disclosure is needed

@luraess
Copy link
Member

luraess commented Feb 23, 2026

Thanks for looking into memory allocation and management challenges.

Given this has possibly critical impacts in terms of reliability and performance of the software, I am thankful that other JuliaGPU devs are looking into it as well.

I'd second @gbaraldi suggestions and would also like to see some benchmark in order to assess the situation. Running GC too often may have significant performance impacts and not all workflows or application may actually benefit from e.g. having eager_gc enabled.

@neoblizz
Copy link
Contributor Author

@luraess thanks!
@gbaraldi Thank you for the review! Definitely not discouraged.

I guess the main idea is what is motivating this? Changing the whole memory allocation logic should have some benchmarks/motivation associated with the PR.

I understand this is a big change, I was running into situations where I was running a code that resulted in a lot large memory usage in AMD than in CUDA (ran out of memory on AMD GPU with much larger memory than a CUDA GPU with much smaller memory). After a good bit of investigation I narrowed it down to the functionality differences in the gc and how memory is reclaimed on CUDA.jl vs. AMDGPU.jl. I can provide some motivating data associated with this change soon.

Also FYI if this was written with AI help please do mention it, we have nothing against AI if the user is on the loop, but disclosure is needed

Understood - Will add a disclosure, its not entirely AI written. I did use it to help me understand the differences between CUDA.jl's implementation and this.

@neoblizz
Copy link
Contributor Author

Can you split the arch thing into a separate PR?

Got it, will create a separate PR.

@neoblizz
Copy link
Contributor Author

Can you split the arch thing into a separate PR?

#886

@neoblizz
Copy link
Contributor Author

@gbaraldi @luraess I am running more tests, but I created a PR into my PR that showcases a very solid usecase for this change; neoblizz#1

Eager GC Peak Used % of Free Completed
enabled 106.125 GiB 55% 267/267
disabled 190.739 GiB 99.7% (OOM) 133/267
  • Array size: 489.516 MiB
  • Free GPU: 191.225 GiB
  • Total iterations: 267

@gbaraldi
Copy link
Member

That looks great!

@neoblizz
Copy link
Contributor Author

neoblizz commented Feb 25, 2026

@gbaraldi I found a separate issue where the pool isn't actually being used... 😄 Should I make additional fixes (memory leak and actually using the pool) in a separate PR?

https://github.com/JuliaGPU/AMDGPU.jl/blob/master/src/runtime/memory/hip.jl#L49-L50

(cc @luraess )

@gbaraldi
Copy link
Member

That seems related enough :)

@neoblizz neoblizz force-pushed the neoblizz/memory-improvements branch from 18127d0 to 9f84735 Compare February 25, 2026 02:14
@neoblizz neoblizz requested a review from gbaraldi February 25, 2026 15:53
@neoblizz
Copy link
Contributor Author

Requesting a new review because of the changes to HostBuffer (to fix memory leak, because it was unpinning all types of allocs even when they needed to be properly freed), and use of Pool. @gbaraldi

@neoblizz
Copy link
Contributor Author

@gbaraldi just checking if I can get a review again, appreciate the time!

@gbaraldi
Copy link
Member

LGTM. Though all of this makes me think we may want to have a package to share this code with CUDA.jl somewhere, given HIP seems to follow the CUDA APIs pretty closely @maleadt @vchuravy

@neoblizz
Copy link
Contributor Author

LGTM. Though all of this makes me think we may want to have a package to share this code with CUDA.jl somewhere, given HIP seems to follow the CUDA APIs pretty closely @maleadt @vchuravy

I agree, in fact most of the cuda* equivalent should have hip*. Something to consider as a major refactor in the future. Also, please merge if its good and thank you for the reviews! 😊

Co-authored-by: Ludovic Räss <61313342+luraess@users.noreply.github.com>
@luraess
Copy link
Member

luraess commented Feb 28, 2026

Is there a need for updating anything in the doc wrt this PR (e.g. in here https://amdgpu.juliagpu.org/dev/api/memory).

Also, are the new additions causing any API changes that would require a major release?

Besides the pool statistics benchmark results report (which looks great), I do not see the test from neoblizz#1 actually merged in this PR branch. Do I overlook something? Also, besides the % of free, do you have any reports on the timing with the new memory management approach versus previous?

@neoblizz
Copy link
Contributor Author

Is there a need for updating anything in the doc wrt this PR (e.g. in here https://amdgpu.juliagpu.org/dev/api/memory).

Also, are the new additions causing any API changes that would require a major release?

Besides the pool statistics benchmark results report (which looks great), I do not see the test from neoblizz#1 actually merged in this PR branch. Do I overlook something? Also, besides the % of free, do you have any reports on the timing with the new memory management approach versus previous?

I have the test in the other branch, I will create a separate PR for that.

I did some testing on a Quantum Error Correction (QEC) workload -- there's no significant difference in runtime. If there's workloads you'd like me to run, I can run to make sure this holds true. Let me share raw latency results with this version and the one in master for QEC.

@neoblizz
Copy link
Contributor Author

Is there a need for updating anything in the doc wrt this PR (e.g. in here https://amdgpu.juliagpu.org/dev/api/memory).

Going through the docs now!

@neoblizz
Copy link
Contributor Author

neoblizz commented Mar 3, 2026

Measuring the runtime diffs vs. master for some of the tests. Effectively no change on an AMD MI300X GPU.

Test neoblizz:neoblizz/memory-improvements (s) master (s) Delta (s) Delta (%)
gpuarrays/reductions/minimum maximum extrema 210.04 211.45 -1.41 -0.7%
hip_rocarray/blas 185.75 184.47 +1.28 +0.7%
gpuarrays/linalg/norm 183.70 182.08 +1.62 +0.9%
gpuarrays/reductions/sum prod 171.39 173.80 -2.41 -1.4%
gpuarrays/linalg/kron 159.97 159.21 +0.76 +0.5%
gpuarrays/linalg/core 148.97 147.94 +1.03 +0.7%
gpuarrays/reductions/mapreduce 138.75 138.32 +0.43 +0.3%
gpuarrays/reductions/mapreducedim! 135.21 137.19 -1.98 -1.4%

@neoblizz
Copy link
Contributor Author

neoblizz commented Mar 3, 2026

GC Time & GC %

Test neoblizz:neoblizz/memory-improvements GC (s) master GC (s) MI GC % master GC %
gpuarrays/reductions/min max extrema 4.76 4.82 2.3 2.3
hip_rocarray/blas 5.23 5.25 2.8 2.8
gpuarrays/linalg/norm 3.56 3.58 1.9 2.0
gpuarrays/reductions/sum prod 3.41 3.49 2.0 2.0
gpuarrays/linalg/kron 27.07 27.61 16.9 17.3
gpuarrays/linalg/core 3.82 3.77 2.6 2.6
gpuarrays/reductions/mapreduce 2.99 2.74 2.2 2.0
gpuarrays/reductions/mapreducedim! 2.82 2.91 2.1 2.1
gpuarrays/reductions/reduce 3.14 2.93 2.3 2.2

GC is identical between branches.

CPU Allocations (MB)

Test neoblizz:neoblizz/memory-improvements Alloc (MB) master Alloc (MB) Delta (MB) Delta (%)
gpuarrays/reductions/min max extrema 16,490 16,310 +180 +1.1%
hip_rocarray/blas 20,661 20,528 +133 +0.6%
gpuarrays/linalg/norm 12,457 12,262 +195 +1.6%
gpuarrays/reductions/sum prod 12,020 11,822 +198 +1.7%
gpuarrays/linalg/kron 33,856 33,727 +129 +0.4%
gpuarrays/linalg/core 14,212 14,053 +159 +1.1%
gpuarrays/reductions/mapreduce 10,367 10,187 +180 +1.8%
gpuarrays/reductions/mapreducedim! 9,017 8,888 +128 +1.4%
gpuarrays/reductions/reduce 10,149 9,972 +177 +1.8%

Slightly more allocations on the CPU-side.

RSS

Test neoblizz:neoblizz/memory-improvements RSS (MB) master RSS (MB) Delta (MB)
gpuarrays/reductions/min max extrema 4,301 4,304 -3
hip_rocarray/blas 4,305 4,309 -4
gpuarrays/linalg/norm 4,305 4,309 -4
gpuarrays/reductions/sum prod 4,305 4,309 -4
gpuarrays/linalg/kron 4,279 4,309 -30
gpuarrays/linalg/core 4,279 4,309 -30
gpuarrays/reductions/mapreduce 4,279 4,309 -30
gpuarrays/reductions/mapreducedim! 4,279 4,309 -30
gpuarrays/reductions/reduce 4,279 4,309 -30

Roughly the same, slightly better.

@luraess
Copy link
Member

luraess commented Mar 4, 2026

Thanks for reporting these benchmarks results! LGT from what I can see. Would it make sense to modify/add something wrt the changes in https://amdgpu.juliagpu.org/dev/api/memory? @gbaraldi anything else I would overlook?

@gbaraldi
Copy link
Member

gbaraldi commented Mar 4, 2026

Nope. Looks good to me

@neoblizz
Copy link
Contributor Author

neoblizz commented Mar 4, 2026

Thanks for reporting these benchmarks results! LGT from what I can see. Would it make sense to modify/add something wrt the changes in https://amdgpu.juliagpu.org/dev/api/memory? @gbaraldi anything else I would overlook?

HostAlloc and this will need to change; https://amdgpu.juliagpu.org/dev/api/memory#:~:text=Passing%20own%3Dtrue%20keyword%20will%20make%20the%20wrapped%20array%20take%20the%20ownership%20of%20the%20memory.%20For%20host%20memory%20it%20will%20unpin%20it%20on%20destruction%20and%20for%20device%20memory%20it%20will%20free%20it.

I can propose docs changes in a separate PR? (Will do this over the weekend).

@luraess luraess merged commit b7fb0b0 into JuliaGPU:master Mar 4, 2026
3 checks passed
@luraess
Copy link
Member

luraess commented Mar 4, 2026

Thanks! And that'd be great if one could fix the docs in the near future as suggested.

@neoblizz
Copy link
Contributor Author

neoblizz commented Mar 4, 2026

Will do, thank you for letting me contribute! @luraess @gbaraldi

I'll bring up a few more PRs based on what we discussed.

@neoblizz neoblizz deleted the neoblizz/memory-improvements branch March 4, 2026 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants