Fix shared memory offset not reset between CUDA kernels.#442
Fix shared memory offset not reset between CUDA kernels.#442
Conversation
tests/python/test_shared_array.py
Outdated
|
|
||
| @test_utils.test(arch=[qd.cuda], print_full_traceback=False) | ||
| def test_shared_array_not_accumulated_across_offloads(): | ||
| """Shared memory from one offloaded task must not leak into the next.""" |
There was a problem hiding this comment.
add a comment about how this test works conceptually.
There was a problem hiding this comment.
notably there seem to be no asserts in this test?
There was a problem hiding this comment.
I could add asserts. For now it is just checking that it does not crash.
There was a problem hiding this comment.
yes, let's have some asserts please.
|
Could you explain step by step:
|
Done (see PR description and unit test description). |
|
No AI were involved in this PR, including its description, code comments, and conversations. I take full responsibility for the lines added and removed in this PR. I'm confident that this PR is rock solid and does not introduce any additional bug that was not preexisting. I won't blame any issue on anybody or anything but me. |
|
|
||
| @test_utils.test(arch=[qd.cuda, qd.metal]) | ||
| def test_shared_array_not_accumulated_across_offloads(): | ||
| # Execute 2 successive offloaded tasks both allocating more than half of |
There was a problem hiding this comment.
Thanks.
Could we have a test where the 'type' of the shared memory in each of several offloaded tasks, within the same qd kernel, is different in some way? (eg scalar vs vector vs matrix; and maybe f32 vs i8 vs f64 etc)
There was a problem hiding this comment.
I cannot without breaking support of Metal.
There was a problem hiding this comment.
Well, actually I can if limited to 32bits.
There was a problem hiding this comment.
I was thinking like:
@pytest.mark.parametrize("dtype1", [qd.u8, qd.i8, qd.f32])
@pytest.mark.parametrize("dtype2", [qd.u8, qd.i8, qd.f32])
Although to be honest, now that I've understood the issue, it seems the tests I asked for don't actually match what was broken. 🤔 Probalby good to test with different types though.
Perhaps the test should be:
- always use the same shape and type for both arrays
- parametrize shape over [(100,), (100, 50),]
- paraemterize type over [qd.u8, qd.i8, qd.f32]
so conceputally:
@parametrize shape over [(100,), (100, 50),]
@paraemterize dtype over [qd.u8, qd.i8, qd.f32]
def (dtype, shape):
dtype1 = dtype2 = dtype
shape1 = shape2 = shape
... test here...
Thoughts?
| if qd.cfg.arch == qd.cuda: | ||
| max_shared_bytes = qd.lang.impl.get_max_shared_memory_bytes() | ||
| else: | ||
| # Metal guarantees 32KB of threadgroup memory |
There was a problem hiding this comment.
didn't you say elsewhere that AMD provides 64?
There was a problem hiding this comment.
I feel it would be more portable to simply put this information into the function above though, so it's all in one place?
There was a problem hiding this comment.
didn't you say elsewhere that AMD provides 64?
It just happens that it is currently the case for most AMD GPU, but this could change (and maybe very old AMD GPUs are using smaller size I don't know.
I feel it would be more portable to simply put this information into the function above though, so it's all in one place?
I don't want to hardcode something that is just an educated guess in a function that is supposed to provide the true information.
There was a problem hiding this comment.
sure. but can we put all this knowledge into the function, rather than hardcoding it sprayed around the code?
There was a problem hiding this comment.
I feel that whateve heuristic we are using (eg "always 32kb") should be localized in get_max_shared_memory_bytes function please.
There was a problem hiding this comment.
What about adding an optional argument def get_max_shared_memory_bytes(strict: bool = True) -> int that would return best guess available if False?
There was a problem hiding this comment.
what happens if pass in False? throws ecxeption?
tests/python/test_shared_array.py
Outdated
| qd.simt.block.sync() | ||
| a[i] = acc | ||
|
|
||
| # gpu_graph requires device-resident ndarrays |
There was a problem hiding this comment.
cf passing in numpy arrays?
There was a problem hiding this comment.
Added more details.
tests/python/test_shared_array.py
Outdated
| d_np = np.random.randn(N).astype(np.float32) | ||
|
|
||
| @qd.kernel | ||
| def calc( |
There was a problem hiding this comment.
I nkow it's not part of your changes, but could we rename this calc_reference pleaes? It was confusing for me until I got to line 142, and then I realized how these two functions relate to each other.
There was a problem hiding this comment.
also could we add a comment about what the kernel is doing. Maybe it's doing a dot product? (or perhaps an outer product?). If so, can we call it dot_product_reference and dot_product_shared_array (or outer product, if it's an outer product)?
0467f46 to
51bc424
Compare
f9848ee to
0670652
Compare
This PR fixes 2 independent bugs related to large shared memory:
[1] CUDA Graph not supporting large shared memory at all
[2] All CUDA kernels of the same compilation unit (ie part of the same Quadrants kernel) are sharing the same pool of singleton tensor types for efficiency. The current version was mutating in-place the shape of tensor types corresponding to shared memory, thereby corrupting other tasks being compiled at the same time. The effect of the corruption is that the amount of the shared memory that will be requested by other tasks will be 0 (aka
shared_array_bytes = 0becausetensor_type->get_num_elements() == 0from now on). Because of this, no shared memory will be available for all other tasks, leading to illegal memory accesses at runtime.To address bug [1], flag 'CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES' needed to be toggled on in kernel context, similar to what we are already doing for "classical" CUDA kernels.
To address bug [2], a new type instance of correct dtype and size 0 is created specifically for large shared memory in particular (so-called "dynamically allocated shared memory").
Note that I just discovered these limitations:
I suggest to address them in a latter stage.