Merge/upstream main 2026 05 08#31
Open
npoulad1 wants to merge 142 commits intoamd-integrationfrom
Open
Conversation
…enesis-Embodied-AI#425) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>
Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>
…I#473) Co-authored-by: Hugh Perkins <hughperkins@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…died-AI#623) Co-authored-by: Cursor <cursoragent@cursor.com>
… compute encoder, LLVM-GPU async memory ops (Part 1/2) (Genesis-Embodied-AI#619)
…esis-Embodied-AI#622) Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…ge runs (Genesis-Embodied-AI#629) Co-authored-by: Cursor <cursoragent@cursor.com>
…ackPushes pass + leaf extensions (Genesis-Embodied-AI#621)
Co-authored-by: Johnny Nunez and Hugh Perkins
…mbodied-AI#618) Co-authored-by: Cursor <cursoragent@cursor.com>
…; per-snode / DeviceAllocation invalidation (Part 2/2) (Genesis-Embodied-AI#620)
…-SNode chain-leaf fix (Genesis-Embodied-AI#633)
…ad_top consumer-aware guard (Genesis-Embodied-AI#634)
Bring in 134 upstream commits from Genesis-Embodied-AI/quadrants main
into the amd-integration fork. Conflicts resolved across 28 files;
the resolution preserves AMD-specific work (force_inline hint, i64
ndarray indexing, AMDGPU-specific codegen overrides, HSACo caching,
fn_attrs cache key) while adopting upstream features (loop_name on
RangeFor / OffloadedStmt, use_graph rename of use_cuda_graph, bit-
pointer / quant-type GlobalLoad path, autodiff stack runtime, per-
handle kernel-launcher buffers, hipHostMalloc/Free wiring, ndarray
allocation failure handling, supports_mem_pool plumbing).
Notable resolutions:
- quadrants/codegen/amdgpu/codegen_amdgpu.cpp: keep branchless float
sgn select, keep `optimized_reduction` returning nullptr (forces
the base CAS path with addrspace preserved), keep
`gpu_parallel_range_for_fixed_config` launch path, fold upstream's
bit-pointer load into the AMDGPU GlobalLoadStmt override, keep
`kernel_argument_struct_in_kernarg`, retain fixed-config grid_dim
for range_for / listgen.
- quadrants/runtime/amdgpu/kernel_launcher.{h,cpp}: take upstream
wholesale (per-handle persistent buffers + autodiff support is a
superset of the prior AMD launcher).
- quadrants/runtime/amdgpu/jit_amdgpu.cpp: keep HSACo cache and the
`-force-vector-interleave=8` cl-flag injection.
- quadrants/runtime/llvm/llvm_runtime_executor.cpp: include amdgpu
in the GPU rand-state init path; keep mem-pool support detection.
- quadrants/runtime/llvm/llvm_context.{h,cpp}: keep
`cuda_shfl_xor_sync_f32` patch, keep `block_dim` parameter on
`mark_function_as_amdgpu_kernel`, expose `num_instructions`.
- quadrants/codegen/llvm/codegen_llvm.cpp: keep i64 widening for
ndarray runtime shapes, keep i64 size_var for tensor-element
index, keep i64 linear_index passthrough as address_offset; merge
AMD `kernel_argument_struct_in_kernarg` branch with upstream's
per-task adstack reset block.
- quadrants/program/ndarray.cpp: keep size_t accumulator for
nelement_, drop upstream's int32 overflow warning (superseded by
i64 indexing).
- quadrants/ir/{statements.h,frontend_ir.{h,cpp}},
quadrants/transforms/lower_ast.cpp: carry both `loop_name` and
`force_inline` through ForLoopConfig, FrontendForStmt,
RangeForStmt, OffloadedStmt and the AST-lowering construction
sites.
- quadrants/program/extension.cpp: keep amdgpu omitted from `bls`
(sparse SNode codegen is not yet on AMDGPU; comment documents the
follow-up).
- quadrants/analysis/offline_cache_util.cpp,
python/quadrants/lang/{kernel.py,kernel_impl.py,misc.py,
_fast_caching/src_hasher.py}: keep `fn_attrs` plumbing alongside
upstream's `use_graph`/`name=` renames.
- quadrants/rhi/amdgpu/{amdgpu_context.{h,cpp},amdgpu_device.cpp,
amdgpu_driver_functions.inc.h}: take upstream's hipHostMalloc/Free
+ null-alloc handling; keep AMD-side `kernel_arg_pointer_` storage,
`dynamic_shared_mem_bytes` profiler trace, and the duplicate
`supports_mem_pool` declaration removed.
- .github/workflows/scripts_new/linux/4_test.sh: merge upstream's
TEST_EXIT pattern with the AMD-runner-only `-a amdgpu` invocation
(CUDA phased coverage stays in 4_test_cuda.sh).
- tests/test_utils.py: take upstream's set ordering of `archs`
(functionally equivalent).
Co-authored-by: Cursor <cursoragent@cursor.com>
Run pre-commit on the merge branch and apply the same fixes CI would
have applied:
- black: reformat tests/python/test_fn_attrs.py.
- clang-format: re-wrap argument lists / lambda bodies in
quadrants/codegen/amdgpu/codegen_amdgpu.cpp,
quadrants/codegen/llvm/codegen_llvm.cpp,
quadrants/program/{compile_config.cpp,fn_attrs_registry.h,kernel.h},
quadrants/python/export_lang.cpp,
quadrants/rhi/amdgpu/{amdgpu_context.cpp,amdgpu_device.cpp,
amdgpu_driver_functions.inc.h},
quadrants/runtime/amdgpu/jit_amdgpu.cpp,
quadrants/runtime/llvm/{llvm_context.cpp,llvm_context_pass.h,
runtime_module/runtime.cpp}.
- trailing-whitespace: strip trailing spaces in Dockerfile.rocm.
- ruff/pylint F401/W0611: drop unused `import math` from
python/quadrants/lang/_func_base.py (came in via upstream/main; the
module never references math).
No semantic changes.
Co-authored-by: Cursor <cursoragent@cursor.com>
The upstream merge auto-merged into amdgpu_driver.h cleanly because the same symbols had been added at different line ranges on each side, but the resulting file declared each symbol twice and failed to compile: error: redefinition of 'constexpr const uint32 HIP_MEMPOOL_ATTR_RELEASE_THRESHOLD' error: 'void AMDGPUDriver::malloc_async(...)' cannot be overloaded with itself error: 'void AMDGPUDriver::mem_free_async(...)' cannot be overloaded with itself Keep the upstream-style declarations (with the explanatory mem-pool fallback comment and dev_ptr parameter naming, matching the .cpp definitions in amdgpu_driver.cpp) and drop the AMD-side duplicates. No semantic change — both sides declared the same prototypes; this is purely a textual de-dup so the build link stage succeeds. Co-authored-by: Cursor <cursoragent@cursor.com>
Author
|
The overall run of Genesis on top of new quadrants new upstream changes is crashing. Currently investigating. |
The merge resolution that brought back HEAD's `kernel_argument_struct_in_kernarg()` override on AMDGPU caused an HSA_STATUS_ERROR_ILLEGAL_INSTRUCTION on the very first kernel launch under the upstream'd kernel_launcher. Upstream's launcher passes RuntimeContext to the kernel by pointer (`runtime_context_dev_ptr`), but the override was making codegen emit kernels that receive RuntimeContext by value via kernarg, producing an ABI mismatch. Three hunks in codegen_llvm.cpp drove the by-value path: - early `context_param_type` selection in `init_offloaded_task_function` - `context_val_alloca_` creation/store in the function entry block - `get_context()` returning the alloca instead of the kernarg All three are removed; the kernel signature is now `void(%RuntimeContext*)` on every backend, matching upstream's kernel_launcher ABI. The unrelated AMDGPU pieces of the merge resolution (i64 ndarray indexing widening, addrspace-preserving int-ptr type, fn_attrs threading, `block_dim`-aware mark_function_as_amdgpu_kernel) are kept. Co-authored-by: Cursor <cursoragent@cursor.com>
Brings in amd-integration#29 (chore(lint): fix pre-commit issues; simplify linters workflow). Conflicts in 13 files were all stylistic — the merge branch already had its own pre-commit pass at 67b6f7a, so the conflicts were two parallel format applications of the same code. Resolved by taking ours (already-linted on top of the upstream merge); re-ran pre-commit afterwards to confirm a clean tree (only a trailing-blank-line in codegen_llvm.cpp needed touching up). Non-conflicting amd-integration changes pulled in cleanly: - .github/workflows/linters.yml -> pre-commit/action@v3.0.1 (+ ubuntu-24.04 pin) - .github/workflows/scripts_new/linters.sh -> deleted - .pre-commit-config.yaml -> dropped python3.10 pins Co-authored-by: Cursor <cursoragent@cursor.com>
…text
Restores the pre-merge AMDGPU launch fast path that the wholesale upstream
launcher had displaced. Cumulative effect: 1.387M -> 1.570M env*steps/s on
the Genesis G1 8192-env benchmark (+13.2%, recovering ~10pp of the 22.5pp
post-merge regression vs `baseline_today`).
Diagnosis. Upstream's `kernel_launcher.cpp` passes `RuntimeContext` to the
kernel by *pointer*: it allocates a per-handle device buffer and memcpy's
the entire `RuntimeContext` host->device on every kernel launch (`memcpy_
host_to_device_async`, sizeof(RuntimeContext) bytes), and the kernel reads
context fields by dereferencing the pointer in HBM. The pre-merge AMDGPU
fast path placed the `RuntimeContext` bytes directly into the AQL kernarg
packet (the codegen override `kernel_argument_struct_in_kernarg() == true`
plus a launcher payload of `&ctx.get_context() / sizeof(RuntimeContext)`),
which both (a) drops the per-launch H2D memcpy entirely and (b) reads
context fields from the AMDGPU kernarg cache instead of HBM.
Changes.
- `codegen_llvm.cpp`: re-establish the kernarg-by-value codegen path on
AMDGPU - early `context_param_type` selection in
`init_offloaded_task_function`, alloca + store of the kernarg in the
function entry, and `get_context()` returning the alloca. A previous
iteration of this commit dropped these to chase the post-merge crash;
the actual root cause was the launcher mismatch, not the codegen path,
so it returns intact.
- `kernel_launcher.{h,cpp}`: thread `kernarg_payload` / `kernarg_size`
through `launch_offloaded_tasks{,_with_do_while}` so the AQL kernarg
packet receives the host `RuntimeContext` bytes directly. The
device-side `RuntimeContext` shadow (`runtime_context_dev_ptr`) is now
lazy: only kernels that hit the adstack publish path
(`task.ad_stack.allocas != {}`) allocate it, gated by an `std::any_of`
pre-pass; forward-only kernels (Genesis hot path) skip both the
malloc and the H2D entirely. Adstack-cache invalidation
(`bump_writes_for_kernel_llvm`) is now no-ops away on programs that
have never seen an autodiff kernel via a thread-local `any_autodiff_
seen` flag, which the same `needs_device_runtime_ctx` pre-pass arms.
- `kernel_launcher.h`: per-handle `resolved_funcs` cache to skip the
per-launch `JITModuleAMDGPU::lookup_function` mutex+hash, plus a
thread-local `cached_set_ctx` short-circuit around `make_current()` so
the HIP driver context setter skips the locked global call when the
context is unchanged. Both mirror pre-merge HEAD's hoists.
Co-authored-by: Cursor <cursoragent@cursor.com>
The 2026-05-08 upstream merge introduced two AMDGPU codegen regressions
on top of an unchanged Genesis kernel set. Identified by IR-diffing the
same Genesis hot kernel before/after the merge:
1. RuntimeContext kernarg-by-value spill became packed `align 1`.
Adding `cpu_assert_failed` after `result_buffer` left a trailing
4-byte field that clang folded by emitting RuntimeContext as a
`<{ ptr, ptr, i32, ptr, i32 }>` packed struct with `align 1`. The
AMDGPU backend then lowered the kernarg-load → kernarg-store of the
`RuntimeContext` argument as byte-by-byte copies instead of two 8-byte
coalesced stores, regressing every kernel launch.
Fix: reorder the two int32 fields back-to-back so the struct is
`{ ptr, ptr, i32, i32, ptr }` (8-aligned, no trailing tail-padding).
Post-fix the kernarg load is `align 16` and the spill store is
`align 8`.
2. range_for body functions (`function_body`) lost `alwaysinline` and
the cost-model inliner refuses to inline them on amdgpu because of
an attribute mismatch — body inherits the conservative
`amdgpu-flat-work-group-size="1,128"` fallback in jit_amdgpu.cpp,
while the calling kernel is "64,64", which blocks the inliner.
Result: an `s_swappc_b64` per loop trip per thread.
Fix: re-add size-gated `mark_inline()` for range_for bodies in
`create_offload_range_for` (force-inline iff body ≤200 IR
instructions, default-on; users can still set
`qd.loop_config(force_inline=-1)` to opt out). Also unconditionally
`alwaysinline` SNode child accessors (`get_ch_from_parent`) — they
are 1-GEP wrappers and the same attribute-mismatch path was leaving
the call inside the inner loop.
Promoted `QuadrantsLLVMContext::num_instructions` to public so the
amdgpu codegen can size-gate without duplicating the helper.
Validated end-to-end on Genesis G1 8192-env benchmark (CG, 15 iters,
500 steps, MI300X, FP32):
pre-merge baseline: 1,790,770 env·steps/s
post-merge, no fixes: 1,386,055 env·steps/s (-22.6%)
post-merge, with fixes: 1,569,491 env·steps/s (-12.4%)
Recovers ~10pp of the regression. The remaining ~12% gap traces to
amdgpu kernel function attributes (`uniform-work-group-size`,
`amdgpu-waves-per-eu`, `unsafe-fp-math`, `amdgpu-ieee`) whose post-
LLVM-22 defaults flipped; those will be addressed in a follow-up.
Co-authored-by: Cursor <cursoragent@cursor.com>
…eline) Closes the remaining ~12pp post-merge throughput gap on the Genesis G1 8192-env benchmark by removing the only AMDGPU pipeline change that the LLVM-22 upgrade actually destabilized. Diagnosis. After committing #1 (RuntimeContext alignment) and #2 (range_for body / SNode child accessor inlining), the benchmark sat at 1.57M env·steps/s vs. a 1.79M pre-merge baseline. A symbol-level diff between the post-fix wheel and the last-known-good `wheels_align_inline_fix` wheel (which was actually built before sccache served the current jit_amdgpu.cpp.o) showed the pre-merge wheel was missing every `AMDGPUFlatToGlobalLoadStorePass` symbol — i.e. it had never run that pass, full stop. The wheel built from current source crashed inside `qd.init(arch=qd.amdgpu)` with EXIT=139 in `compile_module_to_hsaco` running the runtime bitcode. Root cause. The pass's `originatesFromScratch` walks `Argument` values back through `Function::users()` to inspect each direct caller, recursing into the corresponding `CallBase::getArgOperand(ArgNo)`. Each crossing of a caller boundary resets the visited-set (line 165, `CallerVisited`), so cycles through the `runtime_*` / `LLVMRuntime_*` call graph aren't broken. The pre-LLVM-22 runtime bitcode happened to keep the recursion shallow; post-LLVM-22 (with kernarg-by-value RuntimeContext, more internal helper functions, and a different inlining shape) the same pass blows the stack on init. Disabling the pass entirely is not an option — without it Genesis's solver kernels emit wrong addresses on the constraint-force path and the simulation poisons with NaN. Fix. Skip `AMDGPUFlatToGlobalLoadStorePass` only on the runtime bitcode module — detected by the unique presence of `runtime_initialize` — and keep running it on every user-kernel module. The runtime functions are inlined into user kernels at user-kernel JIT time, so all their loads/stores still get the flat→global lowering, just in the right module shape (the user kernel is well-formed; the runtime BC is a graph of mutually-recursive helpers). Validated end-to-end on Genesis G1 8192-env benchmark (CG, 15 iters, 500 steps, MI300X, FP32): pre-merge baseline: 1,790,770 env·steps/s post-merge, no fixes: 1,386,055 env·steps/s (-22.6%) fixes #1+#2 only: 1,569,491 env·steps/s (-12.4%) fixes #1+#2 + this commit: 1,791,235 env·steps/s (+0.03%) Recovers the full 22.6pp regression. The previously-suspected "kernel function attributes (uniform-work-group-size, waves-per-eu, fast-math) flipped under LLVM 22" follow-up is no longer a gap to close — `compile_module_to_hsaco` already reapplies those defaults to AMDGPU_KERNEL functions (lines 75-86), and post this commit the benchmark sits on the baseline regardless. Leaving them as-is. Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Brings 134 commits from
Genesis-Embodied-AI/quadrantsmaininto theamd-integrationfork. 28 files had conflicts; resolved across thissingle merge commit.
AMD-specific kept
force_inlineplumbing (ForLoopConfig→FrontendForStmt→RangeForStmt/OffloadedStmt) and AST-lowering construction sitescodegen_llvm.cppandndarray.cpp(dropsupstream's now-redundant int32 overflow warning)
optimized_reductionreturns nullptr (forces base CAS pathwith addrspace preserved); branchless f32/f64
sgn;gpu_parallel_range_for_fixed_configlaunch path;kernel_argument_struct_in_kernargoverride; AMDGPU GlobalLoadoverride now also handles bit-pointer / quant-type loads from
upstream
-force-vector-interleave=8cl-flag injit_amdgpu.cppllvm_runtime_executor.cpp;cuda_shfl_xor_sync_f32patch andblock_dimparameter onmark_function_as_amdgpu_kernelinllvm_context.{h,cpp}fn_attrscache key (offline_cache_util.cpp,python/quadrants/lang/_fast_caching/src_hasher.py) +kernel.py/kernel_impl.py/misc.pyplumbingblsintentionally omitted inextension.cppwith afollow-up comment (sparse SNode codegen is not yet on AMDGPU)
-a amdgpuinvocation; merged upstream'sTEST_EXITpattern (.github/workflows/scripts_new/linux/4_test.sh)Upstream features adopted
loop_nameon RangeFor / OffloadedStmtuse_graphrename ofuse_cuda_graphGlobalLoadStmtpath (folded into AMDGPUoverride)
(
runtime/amdgpu/kernel_launcher.{h,cpp}taken wholesale — supersetof prior AMD launcher; experimental
exp12_diag+ lazy-transferscaffolding dropped)
hipHostMalloc/hipHostFreewiring + ndarray allocation-failurebranch in
rhi/amdgpu/supports_mem_poolplumbing and dynamic shared-mem profiler traceAGENTS.md, user-guide pages, PR-changereporter, etc.)
Conflicted files (28)
quadrants/codegen/amdgpu/codegen_amdgpu.cpp,quadrants/codegen/llvm/{codegen_llvm.cpp,codegen_llvm.h,struct_llvm.cpp},quadrants/runtime/amdgpu/{jit_amdgpu.cpp,kernel_launcher.cpp,kernel_launcher.h},quadrants/runtime/llvm/{llvm_context.cpp,llvm_context.h,llvm_runtime_executor.cpp},quadrants/rhi/amdgpu/{amdgpu_context.cpp,amdgpu_context.h,amdgpu_device.cpp,amdgpu_driver_functions.inc.h},quadrants/program/{compile_config.h,extension.cpp,ndarray.cpp},quadrants/analysis/offline_cache_util.cpp,quadrants/ir/{frontend_ir.cpp,frontend_ir.h,statements.h},quadrants/transforms/lower_ast.cpp,python/quadrants/lang/{kernel.py,kernel_impl.py,misc.py,_fast_caching/src_hasher.py},tests/test_utils.py,.github/workflows/scripts_new/linux/4_test.sh.