Improving memory requirements on AMDGPU#885
Conversation
|
Reran it after the last fix;
|
|
Can you split the arch thing into a separate PR? |
|
I guess the main idea is what is motivating this? Changing the whole memory allocation logic should have some benchmarks/motivation associated with the PR. Also FYI if this was written with AI help please do mention it, we have nothing against AI if the user is on the loop, but disclosure is needed |
|
Thanks for looking into memory allocation and management challenges. Given this has possibly critical impacts in terms of reliability and performance of the software, I am thankful that other JuliaGPU devs are looking into it as well. I'd second @gbaraldi suggestions and would also like to see some benchmark in order to assess the situation. Running GC too often may have significant performance impacts and not all workflows or application may actually benefit from e.g. having |
|
@luraess thanks!
I understand this is a big change, I was running into situations where I was running a code that resulted in a lot large memory usage in AMD than in CUDA (ran out of memory on AMD GPU with much larger memory than a CUDA GPU with much smaller memory). After a good bit of investigation I narrowed it down to the functionality differences in the gc and how memory is reclaimed on CUDA.jl vs. AMDGPU.jl. I can provide some motivating data associated with this change soon.
Understood - Will add a disclosure, its not entirely AI written. I did use it to help me understand the differences between CUDA.jl's implementation and this. |
Got it, will create a separate PR. |
…skip it (same as CUDA.jl)
|
|
@gbaraldi @luraess I am running more tests, but I created a PR into my PR that showcases a very solid usecase for this change; neoblizz#1
|
|
That looks great! |
|
@gbaraldi I found a separate issue where the pool isn't actually being used... 😄 Should I make additional fixes (memory leak and actually using the pool) in a separate PR? https://github.com/JuliaGPU/AMDGPU.jl/blob/master/src/runtime/memory/hip.jl#L49-L50 (cc @luraess ) |
|
That seems related enough :) |
…d free/alloc to avoid memory leaks.
18127d0 to
9f84735
Compare
|
Requesting a new review because of the changes to HostBuffer (to fix memory leak, because it was unpinning all types of allocs even when they needed to be properly freed), and use of Pool. @gbaraldi |
|
@gbaraldi just checking if I can get a review again, appreciate the time! |
I agree, in fact most of the cuda* equivalent should have hip*. Something to consider as a major refactor in the future. Also, please merge if its good and thank you for the reviews! 😊 |
Co-authored-by: Ludovic Räss <61313342+luraess@users.noreply.github.com>
|
Is there a need for updating anything in the doc wrt this PR (e.g. in here https://amdgpu.juliagpu.org/dev/api/memory). Also, are the new additions causing any API changes that would require a major release? Besides the pool statistics benchmark results report (which looks great), I do not see the test from neoblizz#1 actually merged in this PR branch. Do I overlook something? Also, besides the % of free, do you have any reports on the timing with the new memory management approach versus previous? |
I have the test in the other branch, I will create a separate PR for that. I did some testing on a Quantum Error Correction (QEC) workload -- there's no significant difference in runtime. If there's workloads you'd like me to run, I can run to make sure this holds true. Let me share raw latency results with this version and the one in |
Going through the docs now! |
|
Measuring the runtime diffs vs.
|
GC Time & GC %
GC is identical between branches. CPU Allocations (MB)
Slightly more allocations on the CPU-side. RSS
Roughly the same, slightly better. |
|
Thanks for reporting these benchmarks results! LGT from what I can see. Would it make sense to modify/add something wrt the changes in https://amdgpu.juliagpu.org/dev/api/memory? @gbaraldi anything else I would overlook? |
|
Nope. Looks good to me |
HostAlloc and this will need to change; https://amdgpu.juliagpu.org/dev/api/memory#:~:text=Passing%20own%3Dtrue%20keyword%20will%20make%20the%20wrapped%20array%20take%20the%20ownership%20of%20the%20memory.%20For%20host%20memory%20it%20will%20unpin%20it%20on%20destruction%20and%20for%20device%20memory%20it%20will%20free%20it. I can propose docs changes in a separate PR? (Will do this over the weekend). |
|
Thanks! And that'd be great if one could fix the docs in the near future as suggested. |
@luraess reviews would be appreciated!
eager_gctotrueby default.maybe_collect()function proactively triggersGC.gc(false)when GPU memory pressure exceeds ~75%, preventing excessive pool growth.reclaim()function, supports trimming the pool and returns the bytes cleaned.Testing Plan & Summary
Ran all tests locally on gfx942 (MI300X):
Disclaimer: Parts of this PR are developed using claude-4.6-opus AI.