Skip to content

Merge/upstream main 2026 05 08#31

Open
npoulad1 wants to merge 142 commits intoamd-integrationfrom
merge/upstream-main-2026-05-08
Open

Merge/upstream main 2026 05 08#31
npoulad1 wants to merge 142 commits intoamd-integrationfrom
merge/upstream-main-2026-05-08

Conversation

@npoulad1
Copy link
Copy Markdown

@npoulad1 npoulad1 commented May 8, 2026

Summary

Brings 134 commits from Genesis-Embodied-AI/quadrants main into the
amd-integration fork. 28 files had conflicts; resolved across this
single merge commit.

AMD-specific kept

  • force_inline plumbing (ForLoopConfigFrontendForStmt
    RangeForStmt / OffloadedStmt) and AST-lowering construction sites
  • i64 ndarray indexing in codegen_llvm.cpp and ndarray.cpp (drops
    upstream's now-redundant int32 overflow warning)
  • AMDGPU optimized_reduction returns nullptr (forces base CAS path
    with addrspace preserved); branchless f32/f64 sgn;
    gpu_parallel_range_for_fixed_config launch path;
    kernel_argument_struct_in_kernarg override; AMDGPU GlobalLoad
    override now also handles bit-pointer / quant-type loads from
    upstream
  • HSACo cache + -force-vector-interleave=8 cl-flag in jit_amdgpu.cpp
  • AMDGPU rand-state init in llvm_runtime_executor.cpp;
    cuda_shfl_xor_sync_f32 patch and block_dim parameter on
    mark_function_as_amdgpu_kernel in llvm_context.{h,cpp}
  • fn_attrs cache key (offline_cache_util.cpp,
    python/quadrants/lang/_fast_caching/src_hasher.py) +
    kernel.py / kernel_impl.py / misc.py plumbing
  • AMDGPU bls intentionally omitted in extension.cpp with a
    follow-up comment (sparse SNode codegen is not yet on AMDGPU)
  • AMDGPU CI script keeps the -a amdgpu invocation; merged upstream's
    TEST_EXIT pattern (.github/workflows/scripts_new/linux/4_test.sh)

Upstream features adopted

  • loop_name on RangeFor / OffloadedStmt
  • use_graph rename of use_cuda_graph
  • bit-pointer / quant-type GlobalLoadStmt path (folded into AMDGPU
    override)
  • autodiff stack runtime support (size-expr eval, per-task heap)
  • per-handle persistent kernel-launcher buffers
    (runtime/amdgpu/kernel_launcher.{h,cpp} taken wholesale — superset
    of prior AMD launcher; experimental exp12_diag + lazy-transfer
    scaffolding dropped)
  • hipHostMalloc / hipHostFree wiring + ndarray allocation-failure
    branch in rhi/amdgpu/
  • supports_mem_pool plumbing and dynamic shared-mem profiler trace
  • new docs / workflow files (AGENTS.md, user-guide pages, PR-change
    reporter, etc.)

Conflicted files (28)

quadrants/codegen/amdgpu/codegen_amdgpu.cpp,
quadrants/codegen/llvm/{codegen_llvm.cpp,codegen_llvm.h,struct_llvm.cpp},
quadrants/runtime/amdgpu/{jit_amdgpu.cpp,kernel_launcher.cpp,kernel_launcher.h},
quadrants/runtime/llvm/{llvm_context.cpp,llvm_context.h,llvm_runtime_executor.cpp},
quadrants/rhi/amdgpu/{amdgpu_context.cpp,amdgpu_context.h,amdgpu_device.cpp,amdgpu_driver_functions.inc.h},
quadrants/program/{compile_config.h,extension.cpp,ndarray.cpp},
quadrants/analysis/offline_cache_util.cpp,
quadrants/ir/{frontend_ir.cpp,frontend_ir.h,statements.h},
quadrants/transforms/lower_ast.cpp,
python/quadrants/lang/{kernel.py,kernel_impl.py,misc.py,_fast_caching/src_hasher.py},
tests/test_utils.py,
.github/workflows/scripts_new/linux/4_test.sh.

v01dXYZ and others added 30 commits March 26, 2026 11:01
Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>
…I#473)

Co-authored-by: Hugh Perkins <hughperkins@gmail.com>
hughperkins and others added 24 commits May 4, 2026 10:59
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Johnny Nunez and Hugh Perkins
Bring in 134 upstream commits from Genesis-Embodied-AI/quadrants main
into the amd-integration fork. Conflicts resolved across 28 files;
the resolution preserves AMD-specific work (force_inline hint, i64
ndarray indexing, AMDGPU-specific codegen overrides, HSACo caching,
fn_attrs cache key) while adopting upstream features (loop_name on
RangeFor / OffloadedStmt, use_graph rename of use_cuda_graph, bit-
pointer / quant-type GlobalLoad path, autodiff stack runtime, per-
handle kernel-launcher buffers, hipHostMalloc/Free wiring, ndarray
allocation failure handling, supports_mem_pool plumbing).

Notable resolutions:
- quadrants/codegen/amdgpu/codegen_amdgpu.cpp: keep branchless float
  sgn select, keep `optimized_reduction` returning nullptr (forces
  the base CAS path with addrspace preserved), keep
  `gpu_parallel_range_for_fixed_config` launch path, fold upstream's
  bit-pointer load into the AMDGPU GlobalLoadStmt override, keep
  `kernel_argument_struct_in_kernarg`, retain fixed-config grid_dim
  for range_for / listgen.
- quadrants/runtime/amdgpu/kernel_launcher.{h,cpp}: take upstream
  wholesale (per-handle persistent buffers + autodiff support is a
  superset of the prior AMD launcher).
- quadrants/runtime/amdgpu/jit_amdgpu.cpp: keep HSACo cache and the
  `-force-vector-interleave=8` cl-flag injection.
- quadrants/runtime/llvm/llvm_runtime_executor.cpp: include amdgpu
  in the GPU rand-state init path; keep mem-pool support detection.
- quadrants/runtime/llvm/llvm_context.{h,cpp}: keep
  `cuda_shfl_xor_sync_f32` patch, keep `block_dim` parameter on
  `mark_function_as_amdgpu_kernel`, expose `num_instructions`.
- quadrants/codegen/llvm/codegen_llvm.cpp: keep i64 widening for
  ndarray runtime shapes, keep i64 size_var for tensor-element
  index, keep i64 linear_index passthrough as address_offset; merge
  AMD `kernel_argument_struct_in_kernarg` branch with upstream's
  per-task adstack reset block.
- quadrants/program/ndarray.cpp: keep size_t accumulator for
  nelement_, drop upstream's int32 overflow warning (superseded by
  i64 indexing).
- quadrants/ir/{statements.h,frontend_ir.{h,cpp}},
  quadrants/transforms/lower_ast.cpp: carry both `loop_name` and
  `force_inline` through ForLoopConfig, FrontendForStmt,
  RangeForStmt, OffloadedStmt and the AST-lowering construction
  sites.
- quadrants/program/extension.cpp: keep amdgpu omitted from `bls`
  (sparse SNode codegen is not yet on AMDGPU; comment documents the
  follow-up).
- quadrants/analysis/offline_cache_util.cpp,
  python/quadrants/lang/{kernel.py,kernel_impl.py,misc.py,
  _fast_caching/src_hasher.py}: keep `fn_attrs` plumbing alongside
  upstream's `use_graph`/`name=` renames.
- quadrants/rhi/amdgpu/{amdgpu_context.{h,cpp},amdgpu_device.cpp,
  amdgpu_driver_functions.inc.h}: take upstream's hipHostMalloc/Free
  + null-alloc handling; keep AMD-side `kernel_arg_pointer_` storage,
  `dynamic_shared_mem_bytes` profiler trace, and the duplicate
  `supports_mem_pool` declaration removed.
- .github/workflows/scripts_new/linux/4_test.sh: merge upstream's
  TEST_EXIT pattern with the AMD-runner-only `-a amdgpu` invocation
  (CUDA phased coverage stays in 4_test_cuda.sh).
- tests/test_utils.py: take upstream's set ordering of `archs`
  (functionally equivalent).

Co-authored-by: Cursor <cursoragent@cursor.com>
Run pre-commit on the merge branch and apply the same fixes CI would
have applied:

- black: reformat tests/python/test_fn_attrs.py.
- clang-format: re-wrap argument lists / lambda bodies in
  quadrants/codegen/amdgpu/codegen_amdgpu.cpp,
  quadrants/codegen/llvm/codegen_llvm.cpp,
  quadrants/program/{compile_config.cpp,fn_attrs_registry.h,kernel.h},
  quadrants/python/export_lang.cpp,
  quadrants/rhi/amdgpu/{amdgpu_context.cpp,amdgpu_device.cpp,
  amdgpu_driver_functions.inc.h},
  quadrants/runtime/amdgpu/jit_amdgpu.cpp,
  quadrants/runtime/llvm/{llvm_context.cpp,llvm_context_pass.h,
  runtime_module/runtime.cpp}.
- trailing-whitespace: strip trailing spaces in Dockerfile.rocm.
- ruff/pylint F401/W0611: drop unused `import math` from
  python/quadrants/lang/_func_base.py (came in via upstream/main; the
  module never references math).

No semantic changes.

Co-authored-by: Cursor <cursoragent@cursor.com>
The upstream merge auto-merged into amdgpu_driver.h cleanly because the
same symbols had been added at different line ranges on each side, but
the resulting file declared each symbol twice and failed to compile:

  error: redefinition of 'constexpr const uint32 HIP_MEMPOOL_ATTR_RELEASE_THRESHOLD'
  error: 'void AMDGPUDriver::malloc_async(...)'  cannot be overloaded with itself
  error: 'void AMDGPUDriver::mem_free_async(...)' cannot be overloaded with itself

Keep the upstream-style declarations (with the explanatory mem-pool
fallback comment and dev_ptr parameter naming, matching the .cpp
definitions in amdgpu_driver.cpp) and drop the AMD-side duplicates.

No semantic change — both sides declared the same prototypes; this is
purely a textual de-dup so the build link stage succeeds.

Co-authored-by: Cursor <cursoragent@cursor.com>
@npoulad1
Copy link
Copy Markdown
Author

npoulad1 commented May 8, 2026

The overall run of Genesis on top of new quadrants new upstream changes is crashing. Currently investigating.

npoulad1 and others added 5 commits May 8, 2026 18:55
The merge resolution that brought back HEAD's `kernel_argument_struct_in_kernarg()`
override on AMDGPU caused an HSA_STATUS_ERROR_ILLEGAL_INSTRUCTION on the very
first kernel launch under the upstream'd kernel_launcher. Upstream's launcher
passes RuntimeContext to the kernel by pointer (`runtime_context_dev_ptr`), but
the override was making codegen emit kernels that receive RuntimeContext by
value via kernarg, producing an ABI mismatch.

Three hunks in codegen_llvm.cpp drove the by-value path:
  - early `context_param_type` selection in `init_offloaded_task_function`
  - `context_val_alloca_` creation/store in the function entry block
  - `get_context()` returning the alloca instead of the kernarg

All three are removed; the kernel signature is now `void(%RuntimeContext*)`
on every backend, matching upstream's kernel_launcher ABI. The unrelated
AMDGPU pieces of the merge resolution (i64 ndarray indexing widening,
addrspace-preserving int-ptr type, fn_attrs threading, `block_dim`-aware
mark_function_as_amdgpu_kernel) are kept.

Co-authored-by: Cursor <cursoragent@cursor.com>
Brings in amd-integration#29 (chore(lint): fix pre-commit issues; simplify
linters workflow). Conflicts in 13 files were all stylistic — the merge
branch already had its own pre-commit pass at 67b6f7a, so the conflicts
were two parallel format applications of the same code. Resolved by
taking ours (already-linted on top of the upstream merge); re-ran
pre-commit afterwards to confirm a clean tree (only a trailing-blank-line
in codegen_llvm.cpp needed touching up). Non-conflicting amd-integration
changes pulled in cleanly:
  - .github/workflows/linters.yml -> pre-commit/action@v3.0.1 (+ ubuntu-24.04 pin)
  - .github/workflows/scripts_new/linters.sh -> deleted
  - .pre-commit-config.yaml -> dropped python3.10 pins

Co-authored-by: Cursor <cursoragent@cursor.com>
…text

Restores the pre-merge AMDGPU launch fast path that the wholesale upstream
launcher had displaced. Cumulative effect: 1.387M -> 1.570M env*steps/s on
the Genesis G1 8192-env benchmark (+13.2%, recovering ~10pp of the 22.5pp
post-merge regression vs `baseline_today`).

Diagnosis. Upstream's `kernel_launcher.cpp` passes `RuntimeContext` to the
kernel by *pointer*: it allocates a per-handle device buffer and memcpy's
the entire `RuntimeContext` host->device on every kernel launch (`memcpy_
host_to_device_async`, sizeof(RuntimeContext) bytes), and the kernel reads
context fields by dereferencing the pointer in HBM. The pre-merge AMDGPU
fast path placed the `RuntimeContext` bytes directly into the AQL kernarg
packet (the codegen override `kernel_argument_struct_in_kernarg() == true`
plus a launcher payload of `&ctx.get_context() / sizeof(RuntimeContext)`),
which both (a) drops the per-launch H2D memcpy entirely and (b) reads
context fields from the AMDGPU kernarg cache instead of HBM.

Changes.
- `codegen_llvm.cpp`: re-establish the kernarg-by-value codegen path on
  AMDGPU - early `context_param_type` selection in
  `init_offloaded_task_function`, alloca + store of the kernarg in the
  function entry, and `get_context()` returning the alloca. A previous
  iteration of this commit dropped these to chase the post-merge crash;
  the actual root cause was the launcher mismatch, not the codegen path,
  so it returns intact.
- `kernel_launcher.{h,cpp}`: thread `kernarg_payload` / `kernarg_size`
  through `launch_offloaded_tasks{,_with_do_while}` so the AQL kernarg
  packet receives the host `RuntimeContext` bytes directly. The
  device-side `RuntimeContext` shadow (`runtime_context_dev_ptr`) is now
  lazy: only kernels that hit the adstack publish path
  (`task.ad_stack.allocas != {}`) allocate it, gated by an `std::any_of`
  pre-pass; forward-only kernels (Genesis hot path) skip both the
  malloc and the H2D entirely. Adstack-cache invalidation
  (`bump_writes_for_kernel_llvm`) is now no-ops away on programs that
  have never seen an autodiff kernel via a thread-local `any_autodiff_
  seen` flag, which the same `needs_device_runtime_ctx` pre-pass arms.
- `kernel_launcher.h`: per-handle `resolved_funcs` cache to skip the
  per-launch `JITModuleAMDGPU::lookup_function` mutex+hash, plus a
  thread-local `cached_set_ctx` short-circuit around `make_current()` so
  the HIP driver context setter skips the locked global call when the
  context is unchanged. Both mirror pre-merge HEAD's hoists.

Co-authored-by: Cursor <cursoragent@cursor.com>
The 2026-05-08 upstream merge introduced two AMDGPU codegen regressions
on top of an unchanged Genesis kernel set. Identified by IR-diffing the
same Genesis hot kernel before/after the merge:

1. RuntimeContext kernarg-by-value spill became packed `align 1`.
   Adding `cpu_assert_failed` after `result_buffer` left a trailing
   4-byte field that clang folded by emitting RuntimeContext as a
   `<{ ptr, ptr, i32, ptr, i32 }>` packed struct with `align 1`. The
   AMDGPU backend then lowered the kernarg-load → kernarg-store of the
   `RuntimeContext` argument as byte-by-byte copies instead of two 8-byte
   coalesced stores, regressing every kernel launch.

   Fix: reorder the two int32 fields back-to-back so the struct is
   `{ ptr, ptr, i32, i32, ptr }` (8-aligned, no trailing tail-padding).
   Post-fix the kernarg load is `align 16` and the spill store is
   `align 8`.

2. range_for body functions (`function_body`) lost `alwaysinline` and
   the cost-model inliner refuses to inline them on amdgpu because of
   an attribute mismatch — body inherits the conservative
   `amdgpu-flat-work-group-size="1,128"` fallback in jit_amdgpu.cpp,
   while the calling kernel is "64,64", which blocks the inliner.
   Result: an `s_swappc_b64` per loop trip per thread.

   Fix: re-add size-gated `mark_inline()` for range_for bodies in
   `create_offload_range_for` (force-inline iff body ≤200 IR
   instructions, default-on; users can still set
   `qd.loop_config(force_inline=-1)` to opt out). Also unconditionally
   `alwaysinline` SNode child accessors (`get_ch_from_parent`) — they
   are 1-GEP wrappers and the same attribute-mismatch path was leaving
   the call inside the inner loop.

Promoted `QuadrantsLLVMContext::num_instructions` to public so the
amdgpu codegen can size-gate without duplicating the helper.

Validated end-to-end on Genesis G1 8192-env benchmark (CG, 15 iters,
500 steps, MI300X, FP32):

  pre-merge baseline:        1,790,770 env·steps/s
  post-merge, no fixes:      1,386,055 env·steps/s   (-22.6%)
  post-merge, with fixes:    1,569,491 env·steps/s   (-12.4%)

Recovers ~10pp of the regression. The remaining ~12% gap traces to
amdgpu kernel function attributes (`uniform-work-group-size`,
`amdgpu-waves-per-eu`, `unsafe-fp-math`, `amdgpu-ieee`) whose post-
LLVM-22 defaults flipped; those will be addressed in a follow-up.

Co-authored-by: Cursor <cursoragent@cursor.com>
…eline)

Closes the remaining ~12pp post-merge throughput gap on the Genesis G1
8192-env benchmark by removing the only AMDGPU pipeline change that the
LLVM-22 upgrade actually destabilized.

Diagnosis. After committing #1 (RuntimeContext alignment) and #2
(range_for body / SNode child accessor inlining), the benchmark sat at
1.57M env·steps/s vs. a 1.79M pre-merge baseline. A symbol-level diff
between the post-fix wheel and the last-known-good `wheels_align_inline_fix`
wheel (which was actually built before sccache served the current
jit_amdgpu.cpp.o) showed the pre-merge wheel was missing every
`AMDGPUFlatToGlobalLoadStorePass` symbol — i.e. it had never run that
pass, full stop. The wheel built from current source crashed inside
`qd.init(arch=qd.amdgpu)` with EXIT=139 in
`compile_module_to_hsaco` running the runtime bitcode.

Root cause. The pass's `originatesFromScratch` walks `Argument` values
back through `Function::users()` to inspect each direct caller, recursing
into the corresponding `CallBase::getArgOperand(ArgNo)`. Each crossing
of a caller boundary resets the visited-set (line 165, `CallerVisited`),
so cycles through the `runtime_*` / `LLVMRuntime_*` call graph aren't
broken. The pre-LLVM-22 runtime bitcode happened to keep the recursion
shallow; post-LLVM-22 (with kernarg-by-value RuntimeContext, more
internal helper functions, and a different inlining shape) the same
pass blows the stack on init. Disabling the pass entirely is not an
option — without it Genesis's solver kernels emit wrong addresses on
the constraint-force path and the simulation poisons with NaN.

Fix. Skip `AMDGPUFlatToGlobalLoadStorePass` only on the runtime
bitcode module — detected by the unique presence of `runtime_initialize`
— and keep running it on every user-kernel module. The runtime
functions are inlined into user kernels at user-kernel JIT time, so
all their loads/stores still get the flat→global lowering, just in
the right module shape (the user kernel is well-formed; the runtime
BC is a graph of mutually-recursive helpers).

Validated end-to-end on Genesis G1 8192-env benchmark (CG, 15 iters,
500 steps, MI300X, FP32):

  pre-merge baseline:           1,790,770 env·steps/s
  post-merge, no fixes:         1,386,055 env·steps/s   (-22.6%)
  fixes #1+#2 only:             1,569,491 env·steps/s   (-12.4%)
  fixes #1+#2 + this commit:    1,791,235 env·steps/s   (+0.03%)

Recovers the full 22.6pp regression. The previously-suspected
"kernel function attributes (uniform-work-group-size, waves-per-eu,
fast-math) flipped under LLVM 22" follow-up is no longer a gap to
close — `compile_module_to_hsaco` already reapplies those defaults to
AMDGPU_KERNEL functions (lines 75-86), and post this commit the
benchmark sits on the baseline regardless. Leaving them as-is.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants