Skip to content

Fix shared memory offset not reset between CUDA kernels.#442

Open
duburcqa wants to merge 13 commits intomainfrom
duburcqa/fix_shared_mem_offset
Open

Fix shared memory offset not reset between CUDA kernels.#442
duburcqa wants to merge 13 commits intomainfrom
duburcqa/fix_shared_mem_offset

Conversation

@duburcqa
Copy link
Copy Markdown
Contributor

@duburcqa duburcqa commented Mar 31, 2026

This PR fixes 2 independent bugs related to large shared memory:
[1] CUDA Graph not supporting large shared memory at all
[2] All CUDA kernels of the same compilation unit (ie part of the same Quadrants kernel) are sharing the same pool of singleton tensor types for efficiency. The current version was mutating in-place the shape of tensor types corresponding to shared memory, thereby corrupting other tasks being compiled at the same time. The effect of the corruption is that the amount of the shared memory that will be requested by other tasks will be 0 (aka shared_array_bytes = 0 because tensor_type->get_num_elements() == 0 from now on). Because of this, no shared memory will be available for all other tasks, leading to illegal memory accesses at runtime.

To address bug [1], flag 'CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES' needed to be toggled on in kernel context, similar to what we are already doing for "classical" CUDA kernels.

To address bug [2], a new type instance of correct dtype and size 0 is created specifically for large shared memory in particular (so-called "dynamically allocated shared memory").


Note that I just discovered these limitations:

// By default, CUDA could allocate up to 48KB static shared arrays.
// It requires dynamic shared memory to allocate a larger array.
// Therefore, when one shared array request for size greater than 48KB,
// we switch it to dynamic allocation.
// In current version, only one dynamic instance is allowed.
// TODO: remove the limit.
constexpr std::size_t cuda_dynamic_shared_array_threshold_bytes = 49152;

// use for auto mesh_local to determine shared-mem size per block (in bytes)
// TODO: get this at runtime
constexpr std::size_t default_shared_mem_size = 65536;

I suggest to address them in a latter stage.


@test_utils.test(arch=[qd.cuda], print_full_traceback=False)
def test_shared_array_not_accumulated_across_offloads():
"""Shared memory from one offloaded task must not leak into the next."""
Copy link
Copy Markdown
Collaborator

@hughperkins hughperkins Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment about how this test works conceptually.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notably there seem to be no asserts in this test?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could add asserts. For now it is just checking that it does not crash.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, let's have some asserts please.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@hughperkins
Copy link
Copy Markdown
Collaborator

Could you explain step by step:

  • what is your hypothesis for what the bug is?
  • how this PR addresses it?

@duburcqa
Copy link
Copy Markdown
Contributor Author

Could you explain step by step:

Done (see PR description and unit test description).

@duburcqa
Copy link
Copy Markdown
Contributor Author

duburcqa commented Mar 31, 2026

No AI were involved in this PR, including its description, code comments, and conversations. I take full responsibility for the lines added and removed in this PR. I'm confident that this PR is rock solid and does not introduce any additional bug that was not preexisting. I won't blame any issue on anybody or anything but me.


@test_utils.test(arch=[qd.cuda, qd.metal])
def test_shared_array_not_accumulated_across_offloads():
# Execute 2 successive offloaded tasks both allocating more than half of
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Could we have a test where the 'type' of the shared memory in each of several offloaded tasks, within the same qd kernel, is different in some way? (eg scalar vs vector vs matrix; and maybe f32 vs i8 vs f64 etc)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot without breaking support of Metal.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, actually I can if limited to 32bits.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking like:

@pytest.mark.parametrize("dtype1", [qd.u8, qd.i8, qd.f32])
@pytest.mark.parametrize("dtype2", [qd.u8, qd.i8, qd.f32])

Although to be honest, now that I've understood the issue, it seems the tests I asked for don't actually match what was broken. 🤔 Probalby good to test with different types though.

Perhaps the test should be:

  • always use the same shape and type for both arrays
  • parametrize shape over [(100,), (100, 50),]
  • paraemterize type over [qd.u8, qd.i8, qd.f32]
    so conceputally:
@parametrize shape over [(100,), (100, 50),]
@paraemterize dtype over [qd.u8, qd.i8, qd.f32]
def (dtype, shape):
    dtype1 = dtype2 = dtype
     shape1 = shape2 = shape
    ... test here...

Thoughts?

if qd.cfg.arch == qd.cuda:
max_shared_bytes = qd.lang.impl.get_max_shared_memory_bytes()
else:
# Metal guarantees 32KB of threadgroup memory
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't you say elsewhere that AMD provides 64?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it would be more portable to simply put this information into the function above though, so it's all in one place?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't you say elsewhere that AMD provides 64?

It just happens that it is currently the case for most AMD GPU, but this could change (and maybe very old AMD GPUs are using smaller size I don't know.

I feel it would be more portable to simply put this information into the function above though, so it's all in one place?

I don't want to hardcode something that is just an educated guess in a function that is supposed to provide the true information.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. but can we put all this knowledge into the function, rather than hardcoding it sprayed around the code?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that whateve heuristic we are using (eg "always 32kb") should be localized in get_max_shared_memory_bytes function please.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree.

Copy link
Copy Markdown
Contributor Author

@duburcqa duburcqa Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about adding an optional argument def get_max_shared_memory_bytes(strict: bool = True) -> int that would return best guess available if False?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if pass in False? throws ecxeption?

qd.simt.block.sync()
a[i] = acc

# gpu_graph requires device-resident ndarrays
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cf passing in numpy arrays?

Copy link
Copy Markdown
Contributor Author

@duburcqa duburcqa Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more details.

d_np = np.random.randn(N).astype(np.float32)

@qd.kernel
def calc(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I nkow it's not part of your changes, but could we rename this calc_reference pleaes? It was confusing for me until I got to line 142, and then I realized how these two functions relate to each other.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also could we add a comment about what the kernel is doing. Maybe it's doing a dot product? (or perhaps an outer product?). If so, can we call it dot_product_reference and dot_product_shared_array (or outer product, if it's an outer product)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@duburcqa duburcqa force-pushed the duburcqa/fix_shared_mem_offset branch from 0467f46 to 51bc424 Compare March 31, 2026 22:18
@duburcqa duburcqa force-pushed the duburcqa/fix_shared_mem_offset branch from f9848ee to 0670652 Compare March 31, 2026 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants