Skip to content

Build native cubins for Ada (sm_89) and Hopper (sm_90)#62

Merged
cmcneil merged 2 commits into
mainfrom
build-native-sm89-sm90
Jun 2, 2026
Merged

Build native cubins for Ada (sm_89) and Hopper (sm_90)#62
cmcneil merged 2 commits into
mainfrom
build-native-sm89-sm90

Conversation

@cmcneil
Copy link
Copy Markdown
Contributor

@cmcneil cmcneil commented Jun 2, 2026

Problem

CMAKE_CUDA_ARCHITECTURES is 75 only (Turing/T4, for the GitHub CI runner), so the published wheel ships an sm_75 cubin plus compute_75 PTX — and nothing else.

On a non-Turing GPU there's no matching cubin, so the driver must JIT the PTX at load time. A PTX JIT requires a driver at least as new as the toolkit that emitted the PTX (this package's is CUDA 12.x). That breaks on Google Cloud Dataflow, whose GPU workers run an R535 / CUDA 12.2 driver.

  • L4 = sm_89 → no native cubin → driver tries to JIT compute_75 PTX
  • the PTX is ISA 8.3 (CUDA 12.3); R535 only parses ≤ 8.2 (CUDA 12.2)

Why it wasn't caught

The failure needs two things at once: a GPU that isn't sm_75 (forces a JIT) and a driver too old to JIT the PTX. Neither dev nor CI hits both:

Environment GPU Driver Path Result
GitHub CI T4 sm_75 native sm_75 cubin, no JIT
pathfinder RTX 5090 sm_120 R580 / CUDA 13 JITs PTX (new driver can)
Dataflow L4 sm_89 R535 / CUDA 12.2 must JIT, old driver can't ❌ 222

Fix

set(CMAKE_CUDA_ARCHITECTURES 75 89 90)
  • 75 Turing (T4 / CI) — unchanged
  • 89 Ada (L4, A40, RTX 40-series)
  • 90 Hopper (H100/H200)

Verification

On sm_89 hardware (RTX 4090):

  • ptxas compiles this package's PTX cleanly to both sm_89 and sm_90.
  • A beamform with a locally-rebuilt sm_89-native _cuda_impl.so produces bit-identical output (max abs diff 0.0) to the original sm_75+PTX (JIT'd) path — same math, just no JIT.

Notes

  • A release (e.g. 0.1.2) is needed for downstream consumers (mergelabs pins mach-beamform) to pick this up.

CMAKE_CUDA_ARCHITECTURES was 75 only (Turing/T4 CI runner), so the wheel
ships an sm_75 cubin + compute_75 PTX and nothing else. On a non-Turing GPU
the driver must JIT that PTX, and a JIT requires a driver at least as new as
the build toolkit. On an L4 (sm_89) with Dataflow's R535 / CUDA 12.2 driver,
JITing this package's CUDA 12.x PTX fails at kernel launch with
cudaErrorUnsupportedPtxVersion (code 222).

Add sm_89 (Ada: L4, RTX 40-series) and sm_90 (Hopper: H100/H200) so those GPUs
load native SASS with no JIT, independent of driver version. Native cubins are
gated only by hardware support, so an R535-class driver runs them fine.

Verified locally: ptxas compiles this package's PTX to both sm_89 and sm_90,
and a beamform on sm_89 hardware (RTX 4090) produces bit-identical output to
the JIT'd sm_75+PTX path.
@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented Jun 2, 2026

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (0) 📎 Requirement gaps (0) 🎨 UX issues (0)

Grey Divider


Action required

1. Unpinned CMake for sm_90 🐞 Bug ☼ Reliability
Description
The PR adds sm_89/sm_90 to CMAKE_CUDA_ARCHITECTURES, but the wheel build configuration doesn’t
install/pin CMake, so the build can fail depending on the builder image’s preinstalled CMake
version. This makes wheel/CI builds fragile and environment-dependent after this change.
Code

CMakeLists.txt[67]

Evidence
The PR change requires CMake to handle additional CUDA architectures, but the repo’s wheel build
configuration does not ensure CMake is present/up-to-date (it’s neither in PEP517 build requirements
nor installed in cibuildwheel’s pre-build step). The repository’s developer build path explicitly
installs CMake, indicating it’s a required build tool that CI should also pin/install.

CMakeLists.txt[55-68]
pyproject.toml[1-3]
pyproject.toml[230-252]
Makefile[42-50]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The project now hard-codes `CMAKE_CUDA_ARCHITECTURES` to include `89` and `90`, but the wheel build flow does not explicitly install CMake (nor a minimum version). As a result, the wheel build may succeed or fail depending on the CMake version baked into the manylinux/builder image.
## Issue Context
- CMake is a build-time dependency and the repo already treats it as such in developer workflows (e.g., `make compile`).
- `cibuildwheel` installs `scikit-build-core`, `nanobind`, and `ninja`, but not `cmake`.
## Fix Focus Areas
- Add `cmake` (with a minimum version) to the build environment used for wheel builds (and ideally to PEP517 build requirements too).
### Recommended changes
1. Update `pyproject.toml`:
- Add `cmake>=<min_version>` (and optionally `ninja`) to `[build-system].requires`, **or**
- Add `cmake>=<min_version>` to `[tool.cibuildwheel].before-build` (since you already install other build tools there).
2. Keep versions aligned with the CUDA builder images used in CI.
## Fix Focus Areas (exact locations)
- pyproject.toml[1-3]
- pyproject.toml[230-252]
- CMakeLists.txt[55-68]
- Makefile[42-50]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Spurious driver-age warning 🐞 Bug ◔ Observability
Description
Even after adding sm_89/sm_90 cubins specifically to avoid PTX JIT on older drivers, the extension
still warns at import-time whenever driver < NVCC version, which can be misleading/noisy when the
native cubin path works. This can also become an actual failure in environments that treat warnings
as errors.
Code

CMakeLists.txt[R55-67]

Evidence
The PR change documents that native cubins are being added specifically to avoid PTX JIT on older
drivers, but the runtime code still warns solely based on driver vs NVCC version and runs the check
on every import regardless of whether a matching cubin exists.

CMakeLists.txt[55-67]
src/mach/kernel.cu[351-388]
src/mach/kernel.cu[1211-1220]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`checkCudaDriverCompatibility()` currently warns whenever `cudaDriverGetVersion()` reports a driver version older than the NVCC version used to compile the module. With this PR, we now intentionally ship native cubins for the deployed architectures (75/89/90) to avoid PTX JIT on older drivers; the unconditional driver-vs-NVCC warning becomes misleading/noisy.
## Issue Context
- The new CMake comment explicitly states native cubins avoid PTX JIT requirements on older drivers.
- The nanobind module calls `checkCudaDriverCompatibility()` on every import.
## Fix Focus Areas
- src/mach/kernel.cu[351-388]
- src/mach/kernel.cu[1211-1220]
- CMakeLists.txt[55-67]
## Suggested implementation direction
- Only warn about driver < NVCC when the current device does *not* have a matching compiled cubin (i.e., when a PTX JIT is likely/required).
- To do that robustly, pass the compiled architecture list into the binary via `target_compile_definitions` (e.g., `BUILT_CUDA_ARCHS="75;89;90"`) and compare it against `cudaGetDeviceProperties()` major/minor at import time.
- Alternatively, soften the message to explicitly state it only matters when PTX JIT is needed (no matching cubin).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. NVCC min gate mismatched 🐞 Bug ⚙ Maintainability
Description
NVCC_MIN_VERSION is still set to 11.0 even though the project’s CI/wheel builds are standardized
on CUDA 12.x and the PR now targets Ada/Hopper explicitly, so the configure-time guard no longer
reflects the project’s real toolchain expectations. This increases the chance that unsupported build
environments get past configuration and fail later in compilation.
Code

CMakeLists.txt[67]

Evidence
The repo’s own CI and wheel build configuration standardize on CUDA 12.x, while the CMake gate still
permits NVCC 11.0; after adding more modern architecture targets, the mismatch becomes more likely
to confuse builders and defer errors to later compilation steps instead of failing at configure time
with a clear requirement.

CMakeLists.txt[55-80]
.github/actions/setup-cuda-python-env/action.yml[13-21]
pyproject.toml[248-252]
README.md[36-51]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The CMake configure-time NVCC version gate (`NVCC_MIN_VERSION`) is set to `11.0`, but the repo’s CI and wheel builds clearly target CUDA 12.x toolchains. With the PR adding explicit Ada/Hopper targets, it’s better to fail fast with a minimum NVCC version that matches the supported build toolchains.
## Issue Context
- CI installs CUDA 12.6.3 for GPU tests.
- Wheel builds use a manylinux image tagged with CUDA 12.3.
- CMake currently allows NVCC 11.0.
## Fix Focus Areas
- Update `NVCC_MIN_VERSION` to match the minimum CUDA toolkit used/supported by the project’s build pipelines (e.g., 12.3 if that’s the wheel image baseline).
- Keep README/build docs consistent with the enforced minimum.
## Fix Focus Areas (exact locations)
- CMakeLists.txt[55-80]
- .github/actions/setup-cuda-python-env/action.yml[13-21]
- pyproject.toml[248-252]
- README.md[36-51]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Comment thread CMakeLists.txt
# 90 = Hopper (H100/H200)
# https://developer.nvidia.com/cuda-gpus
set(CMAKE_CUDA_ARCHITECTURES 75)
set(CMAKE_CUDA_ARCHITECTURES 75 89 90)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Unpinned cmake for sm_90 🐞 Bug ☼ Reliability

The PR adds sm_89/sm_90 to CMAKE_CUDA_ARCHITECTURES, but the wheel build configuration doesn’t
install/pin CMake, so the build can fail depending on the builder image’s preinstalled CMake
version. This makes wheel/CI builds fragile and environment-dependent after this change.
Agent Prompt
## Issue description
The project now hard-codes `CMAKE_CUDA_ARCHITECTURES` to include `89` and `90`, but the wheel build flow does not explicitly install CMake (nor a minimum version). As a result, the wheel build may succeed or fail depending on the CMake version baked into the manylinux/builder image.

## Issue Context
- CMake is a build-time dependency and the repo already treats it as such in developer workflows (e.g., `make compile`).
- `cibuildwheel` installs `scikit-build-core`, `nanobind`, and `ninja`, but not `cmake`.

## Fix Focus Areas
- Add `cmake` (with a minimum version) to the build environment used for wheel builds (and ideally to PEP517 build requirements too).

### Recommended changes
1. Update `pyproject.toml`:
   - Add `cmake>=<min_version>` (and optionally `ninja`) to `[build-system].requires`, **or**
   - Add `cmake>=<min_version>` to `[tool.cibuildwheel].before-build` (since you already install other build tools there).
2. Keep versions aligned with the CUDA builder images used in CI.

## Fix Focus Areas (exact locations)
- pyproject.toml[1-3]
- pyproject.toml[230-252]
- CMakeLists.txt[55-68]
- Makefile[42-50]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unhelpful

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review Bot commented Jun 2, 2026

Code review by qodo was updated up to the latest commit f1bbe33

@cmcneil cmcneil requested review from charlesbmi and dperdios June 2, 2026 18:36
Copy link
Copy Markdown
Collaborator

@charlesbmi charlesbmi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@cmcneil cmcneil removed the request for review from dperdios June 2, 2026 19:05
@cmcneil cmcneil merged commit e21b5fe into main Jun 2, 2026
8 checks passed
@cmcneil cmcneil deleted the build-native-sm89-sm90 branch June 2, 2026 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants