Merge/upstream main 2026 05 08 by npoulad1 · Pull Request #31 · ROCm/quadrants

npoulad1 · 2026-05-08T17:44:58Z

Summary

Brings 134 commits from Genesis-Embodied-AI/quadrants main into the
amd-integration fork. 28 files had conflicts; resolved across this
single merge commit.

AMD-specific kept

force_inline plumbing (ForLoopConfig → FrontendForStmt →
RangeForStmt / OffloadedStmt) and AST-lowering construction sites
i64 ndarray indexing in codegen_llvm.cpp and ndarray.cpp (drops
upstream's now-redundant int32 overflow warning)
AMDGPU optimized_reduction returns nullptr (forces base CAS path
with addrspace preserved); branchless f32/f64 sgn;
gpu_parallel_range_for_fixed_config launch path;
kernel_argument_struct_in_kernarg override; AMDGPU GlobalLoad
override now also handles bit-pointer / quant-type loads from
upstream
HSACo cache + -force-vector-interleave=8 cl-flag in jit_amdgpu.cpp
AMDGPU rand-state init in llvm_runtime_executor.cpp;
cuda_shfl_xor_sync_f32 patch and block_dim parameter on
mark_function_as_amdgpu_kernel in llvm_context.{h,cpp}
fn_attrs cache key (offline_cache_util.cpp,
python/quadrants/lang/_fast_caching/src_hasher.py) +
kernel.py / kernel_impl.py / misc.py plumbing
AMDGPU bls intentionally omitted in extension.cpp with a
follow-up comment (sparse SNode codegen is not yet on AMDGPU)
AMDGPU CI script keeps the -a amdgpu invocation; merged upstream's
TEST_EXIT pattern (.github/workflows/scripts_new/linux/4_test.sh)

Upstream features adopted

loop_name on RangeFor / OffloadedStmt
use_graph rename of use_cuda_graph
bit-pointer / quant-type GlobalLoadStmt path (folded into AMDGPU
override)
autodiff stack runtime support (size-expr eval, per-task heap)
per-handle persistent kernel-launcher buffers
(runtime/amdgpu/kernel_launcher.{h,cpp} taken wholesale — superset
of prior AMD launcher; experimental exp12_diag + lazy-transfer
scaffolding dropped)
hipHostMalloc / hipHostFree wiring + ndarray allocation-failure
branch in rhi/amdgpu/
supports_mem_pool plumbing and dynamic shared-mem profiler trace
new docs / workflow files (AGENTS.md, user-guide pages, PR-change
reporter, etc.)

Conflicted files (28)

quadrants/codegen/amdgpu/codegen_amdgpu.cpp,
quadrants/codegen/llvm/{codegen_llvm.cpp,codegen_llvm.h,struct_llvm.cpp},
quadrants/runtime/amdgpu/{jit_amdgpu.cpp,kernel_launcher.cpp,kernel_launcher.h},
quadrants/runtime/llvm/{llvm_context.cpp,llvm_context.h,llvm_runtime_executor.cpp},
quadrants/rhi/amdgpu/{amdgpu_context.cpp,amdgpu_context.h,amdgpu_device.cpp,amdgpu_driver_functions.inc.h},
quadrants/program/{compile_config.h,extension.cpp,ndarray.cpp},
quadrants/analysis/offline_cache_util.cpp,
quadrants/ir/{frontend_ir.cpp,frontend_ir.h,statements.h},
quadrants/transforms/lower_ast.cpp,
python/quadrants/lang/{kernel.py,kernel_impl.py,misc.py,_fast_caching/src_hasher.py},
tests/test_utils.py,
.github/workflows/scripts_new/linux/4_test.sh.

…enesis-Embodied-AI#425) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>

…ed-AI#428)

…bodied-AI#430)

Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>

…I#420)

…d-AI#435)

…case (Genesis-Embodied-AI#438)

…odied-AI#442)

…Embodied-AI#439)

…s-Embodied-AI#456)

…enesis-Embodied-AI#461)

)

…died-AI#471)

…sis-Embodied-AI#475)

…I#473) Co-authored-by: Hugh Perkins <hughperkins@gmail.com>

…enesis-Embodied-AI#485)

…mbodied-AI#484)

…-Embodied-AI#477)

Co-authored-by: Cursor <cursoragent@cursor.com>

…died-AI#623) Co-authored-by: Cursor <cursoragent@cursor.com>

… compute encoder, LLVM-GPU async memory ops (Part 1/2) (Genesis-Embodied-AI#619)

…esis-Embodied-AI#622) Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

…ge runs (Genesis-Embodied-AI#629) Co-authored-by: Cursor <cursoragent@cursor.com>

…ackPushes pass + leaf extensions (Genesis-Embodied-AI#621)

…bodied-AI#630)

Co-authored-by: Johnny Nunez and Hugh Perkins

…als line (Genesis-Embodied-AI#632)

…mbodied-AI#618) Co-authored-by: Cursor <cursoragent@cursor.com>

…; per-snode / DeviceAllocation invalidation (Part 2/2) (Genesis-Embodied-AI#620)

…-SNode chain-leaf fix (Genesis-Embodied-AI#633)

…ad_top consumer-aware guard (Genesis-Embodied-AI#634)

…bodied-AI#638)

…esis-Embodied-AI#639)

…esis-Embodied-AI#643)

…Embodied-AI#650)

…bodied-AI#640)

…odied-AI#641)

Bring in 134 upstream commits from Genesis-Embodied-AI/quadrants main into the amd-integration fork. Conflicts resolved across 28 files; the resolution preserves AMD-specific work (force_inline hint, i64 ndarray indexing, AMDGPU-specific codegen overrides, HSACo caching, fn_attrs cache key) while adopting upstream features (loop_name on RangeFor / OffloadedStmt, use_graph rename of use_cuda_graph, bit- pointer / quant-type GlobalLoad path, autodiff stack runtime, per- handle kernel-launcher buffers, hipHostMalloc/Free wiring, ndarray allocation failure handling, supports_mem_pool plumbing). Notable resolutions: - quadrants/codegen/amdgpu/codegen_amdgpu.cpp: keep branchless float sgn select, keep `optimized_reduction` returning nullptr (forces the base CAS path with addrspace preserved), keep `gpu_parallel_range_for_fixed_config` launch path, fold upstream's bit-pointer load into the AMDGPU GlobalLoadStmt override, keep `kernel_argument_struct_in_kernarg`, retain fixed-config grid_dim for range_for / listgen. - quadrants/runtime/amdgpu/kernel_launcher.{h,cpp}: take upstream wholesale (per-handle persistent buffers + autodiff support is a superset of the prior AMD launcher). - quadrants/runtime/amdgpu/jit_amdgpu.cpp: keep HSACo cache and the `-force-vector-interleave=8` cl-flag injection. - quadrants/runtime/llvm/llvm_runtime_executor.cpp: include amdgpu in the GPU rand-state init path; keep mem-pool support detection. - quadrants/runtime/llvm/llvm_context.{h,cpp}: keep `cuda_shfl_xor_sync_f32` patch, keep `block_dim` parameter on `mark_function_as_amdgpu_kernel`, expose `num_instructions`. - quadrants/codegen/llvm/codegen_llvm.cpp: keep i64 widening for ndarray runtime shapes, keep i64 size_var for tensor-element index, keep i64 linear_index passthrough as address_offset; merge AMD `kernel_argument_struct_in_kernarg` branch with upstream's per-task adstack reset block. - quadrants/program/ndarray.cpp: keep size_t accumulator for nelement_, drop upstream's int32 overflow warning (superseded by i64 indexing). - quadrants/ir/{statements.h,frontend_ir.{h,cpp}}, quadrants/transforms/lower_ast.cpp: carry both `loop_name` and `force_inline` through ForLoopConfig, FrontendForStmt, RangeForStmt, OffloadedStmt and the AST-lowering construction sites. - quadrants/program/extension.cpp: keep amdgpu omitted from `bls` (sparse SNode codegen is not yet on AMDGPU; comment documents the follow-up). - quadrants/analysis/offline_cache_util.cpp, python/quadrants/lang/{kernel.py,kernel_impl.py,misc.py, _fast_caching/src_hasher.py}: keep `fn_attrs` plumbing alongside upstream's `use_graph`/`name=` renames. - quadrants/rhi/amdgpu/{amdgpu_context.{h,cpp},amdgpu_device.cpp, amdgpu_driver_functions.inc.h}: take upstream's hipHostMalloc/Free + null-alloc handling; keep AMD-side `kernel_arg_pointer_` storage, `dynamic_shared_mem_bytes` profiler trace, and the duplicate `supports_mem_pool` declaration removed. - .github/workflows/scripts_new/linux/4_test.sh: merge upstream's TEST_EXIT pattern with the AMD-runner-only `-a amdgpu` invocation (CUDA phased coverage stays in 4_test_cuda.sh). - tests/test_utils.py: take upstream's set ordering of `archs` (functionally equivalent). Co-authored-by: Cursor <cursoragent@cursor.com>

Run pre-commit on the merge branch and apply the same fixes CI would have applied: - black: reformat tests/python/test_fn_attrs.py. - clang-format: re-wrap argument lists / lambda bodies in quadrants/codegen/amdgpu/codegen_amdgpu.cpp, quadrants/codegen/llvm/codegen_llvm.cpp, quadrants/program/{compile_config.cpp,fn_attrs_registry.h,kernel.h}, quadrants/python/export_lang.cpp, quadrants/rhi/amdgpu/{amdgpu_context.cpp,amdgpu_device.cpp, amdgpu_driver_functions.inc.h}, quadrants/runtime/amdgpu/jit_amdgpu.cpp, quadrants/runtime/llvm/{llvm_context.cpp,llvm_context_pass.h, runtime_module/runtime.cpp}. - trailing-whitespace: strip trailing spaces in Dockerfile.rocm. - ruff/pylint F401/W0611: drop unused `import math` from python/quadrants/lang/_func_base.py (came in via upstream/main; the module never references math). No semantic changes. Co-authored-by: Cursor <cursoragent@cursor.com>

The upstream merge auto-merged into amdgpu_driver.h cleanly because the same symbols had been added at different line ranges on each side, but the resulting file declared each symbol twice and failed to compile: error: redefinition of 'constexpr const uint32 HIP_MEMPOOL_ATTR_RELEASE_THRESHOLD' error: 'void AMDGPUDriver::malloc_async(...)' cannot be overloaded with itself error: 'void AMDGPUDriver::mem_free_async(...)' cannot be overloaded with itself Keep the upstream-style declarations (with the explanatory mem-pool fallback comment and dev_ptr parameter naming, matching the .cpp definitions in amdgpu_driver.cpp) and drop the AMD-side duplicates. No semantic change — both sides declared the same prototypes; this is purely a textual de-dup so the build link stage succeeds. Co-authored-by: Cursor <cursoragent@cursor.com>

npoulad1 · 2026-05-08T18:45:13Z

The overall run of Genesis on top of new quadrants new upstream changes is crashing. Currently investigating.

The merge resolution that brought back HEAD's `kernel_argument_struct_in_kernarg()` override on AMDGPU caused an HSA_STATUS_ERROR_ILLEGAL_INSTRUCTION on the very first kernel launch under the upstream'd kernel_launcher. Upstream's launcher passes RuntimeContext to the kernel by pointer (`runtime_context_dev_ptr`), but the override was making codegen emit kernels that receive RuntimeContext by value via kernarg, producing an ABI mismatch. Three hunks in codegen_llvm.cpp drove the by-value path: - early `context_param_type` selection in `init_offloaded_task_function` - `context_val_alloca_` creation/store in the function entry block - `get_context()` returning the alloca instead of the kernarg All three are removed; the kernel signature is now `void(%RuntimeContext*)` on every backend, matching upstream's kernel_launcher ABI. The unrelated AMDGPU pieces of the merge resolution (i64 ndarray indexing widening, addrspace-preserving int-ptr type, fn_attrs threading, `block_dim`-aware mark_function_as_amdgpu_kernel) are kept. Co-authored-by: Cursor <cursoragent@cursor.com>

Brings in amd-integration#29 (chore(lint): fix pre-commit issues; simplify linters workflow). Conflicts in 13 files were all stylistic — the merge branch already had its own pre-commit pass at 67b6f7a, so the conflicts were two parallel format applications of the same code. Resolved by taking ours (already-linted on top of the upstream merge); re-ran pre-commit afterwards to confirm a clean tree (only a trailing-blank-line in codegen_llvm.cpp needed touching up). Non-conflicting amd-integration changes pulled in cleanly: - .github/workflows/linters.yml -> pre-commit/action@v3.0.1 (+ ubuntu-24.04 pin) - .github/workflows/scripts_new/linters.sh -> deleted - .pre-commit-config.yaml -> dropped python3.10 pins Co-authored-by: Cursor <cursoragent@cursor.com>

…text Restores the pre-merge AMDGPU launch fast path that the wholesale upstream launcher had displaced. Cumulative effect: 1.387M -> 1.570M env*steps/s on the Genesis G1 8192-env benchmark (+13.2%, recovering ~10pp of the 22.5pp post-merge regression vs `baseline_today`). Diagnosis. Upstream's `kernel_launcher.cpp` passes `RuntimeContext` to the kernel by *pointer*: it allocates a per-handle device buffer and memcpy's the entire `RuntimeContext` host->device on every kernel launch (`memcpy_ host_to_device_async`, sizeof(RuntimeContext) bytes), and the kernel reads context fields by dereferencing the pointer in HBM. The pre-merge AMDGPU fast path placed the `RuntimeContext` bytes directly into the AQL kernarg packet (the codegen override `kernel_argument_struct_in_kernarg() == true` plus a launcher payload of `&ctx.get_context() / sizeof(RuntimeContext)`), which both (a) drops the per-launch H2D memcpy entirely and (b) reads context fields from the AMDGPU kernarg cache instead of HBM. Changes. - `codegen_llvm.cpp`: re-establish the kernarg-by-value codegen path on AMDGPU - early `context_param_type` selection in `init_offloaded_task_function`, alloca + store of the kernarg in the function entry, and `get_context()` returning the alloca. A previous iteration of this commit dropped these to chase the post-merge crash; the actual root cause was the launcher mismatch, not the codegen path, so it returns intact. - `kernel_launcher.{h,cpp}`: thread `kernarg_payload` / `kernarg_size` through `launch_offloaded_tasks{,_with_do_while}` so the AQL kernarg packet receives the host `RuntimeContext` bytes directly. The device-side `RuntimeContext` shadow (`runtime_context_dev_ptr`) is now lazy: only kernels that hit the adstack publish path (`task.ad_stack.allocas != {}`) allocate it, gated by an `std::any_of` pre-pass; forward-only kernels (Genesis hot path) skip both the malloc and the H2D entirely. Adstack-cache invalidation (`bump_writes_for_kernel_llvm`) is now no-ops away on programs that have never seen an autodiff kernel via a thread-local `any_autodiff_ seen` flag, which the same `needs_device_runtime_ctx` pre-pass arms. - `kernel_launcher.h`: per-handle `resolved_funcs` cache to skip the per-launch `JITModuleAMDGPU::lookup_function` mutex+hash, plus a thread-local `cached_set_ctx` short-circuit around `make_current()` so the HIP driver context setter skips the locked global call when the context is unchanged. Both mirror pre-merge HEAD's hoists. Co-authored-by: Cursor <cursoragent@cursor.com>

The 2026-05-08 upstream merge introduced two AMDGPU codegen regressions on top of an unchanged Genesis kernel set. Identified by IR-diffing the same Genesis hot kernel before/after the merge: 1. RuntimeContext kernarg-by-value spill became packed `align 1`. Adding `cpu_assert_failed` after `result_buffer` left a trailing 4-byte field that clang folded by emitting RuntimeContext as a `<{ ptr, ptr, i32, ptr, i32 }>` packed struct with `align 1`. The AMDGPU backend then lowered the kernarg-load → kernarg-store of the `RuntimeContext` argument as byte-by-byte copies instead of two 8-byte coalesced stores, regressing every kernel launch. Fix: reorder the two int32 fields back-to-back so the struct is `{ ptr, ptr, i32, i32, ptr }` (8-aligned, no trailing tail-padding). Post-fix the kernarg load is `align 16` and the spill store is `align 8`. 2. range_for body functions (`function_body`) lost `alwaysinline` and the cost-model inliner refuses to inline them on amdgpu because of an attribute mismatch — body inherits the conservative `amdgpu-flat-work-group-size="1,128"` fallback in jit_amdgpu.cpp, while the calling kernel is "64,64", which blocks the inliner. Result: an `s_swappc_b64` per loop trip per thread. Fix: re-add size-gated `mark_inline()` for range_for bodies in `create_offload_range_for` (force-inline iff body ≤200 IR instructions, default-on; users can still set `qd.loop_config(force_inline=-1)` to opt out). Also unconditionally `alwaysinline` SNode child accessors (`get_ch_from_parent`) — they are 1-GEP wrappers and the same attribute-mismatch path was leaving the call inside the inner loop. Promoted `QuadrantsLLVMContext::num_instructions` to public so the amdgpu codegen can size-gate without duplicating the helper. Validated end-to-end on Genesis G1 8192-env benchmark (CG, 15 iters, 500 steps, MI300X, FP32): pre-merge baseline: 1,790,770 env·steps/s post-merge, no fixes: 1,386,055 env·steps/s (-22.6%) post-merge, with fixes: 1,569,491 env·steps/s (-12.4%) Recovers ~10pp of the regression. The remaining ~12% gap traces to amdgpu kernel function attributes (`uniform-work-group-size`, `amdgpu-waves-per-eu`, `unsafe-fp-math`, `amdgpu-ieee`) whose post- LLVM-22 defaults flipped; those will be addressed in a follow-up. Co-authored-by: Cursor <cursoragent@cursor.com>

…eline) Closes the remaining ~12pp post-merge throughput gap on the Genesis G1 8192-env benchmark by removing the only AMDGPU pipeline change that the LLVM-22 upgrade actually destabilized. Diagnosis. After committing #1 (RuntimeContext alignment) and #2 (range_for body / SNode child accessor inlining), the benchmark sat at 1.57M env·steps/s vs. a 1.79M pre-merge baseline. A symbol-level diff between the post-fix wheel and the last-known-good `wheels_align_inline_fix` wheel (which was actually built before sccache served the current jit_amdgpu.cpp.o) showed the pre-merge wheel was missing every `AMDGPUFlatToGlobalLoadStorePass` symbol — i.e. it had never run that pass, full stop. The wheel built from current source crashed inside `qd.init(arch=qd.amdgpu)` with EXIT=139 in `compile_module_to_hsaco` running the runtime bitcode. Root cause. The pass's `originatesFromScratch` walks `Argument` values back through `Function::users()` to inspect each direct caller, recursing into the corresponding `CallBase::getArgOperand(ArgNo)`. Each crossing of a caller boundary resets the visited-set (line 165, `CallerVisited`), so cycles through the `runtime_*` / `LLVMRuntime_*` call graph aren't broken. The pre-LLVM-22 runtime bitcode happened to keep the recursion shallow; post-LLVM-22 (with kernarg-by-value RuntimeContext, more internal helper functions, and a different inlining shape) the same pass blows the stack on init. Disabling the pass entirely is not an option — without it Genesis's solver kernels emit wrong addresses on the constraint-force path and the simulation poisons with NaN. Fix. Skip `AMDGPUFlatToGlobalLoadStorePass` only on the runtime bitcode module — detected by the unique presence of `runtime_initialize` — and keep running it on every user-kernel module. The runtime functions are inlined into user kernels at user-kernel JIT time, so all their loads/stores still get the flat→global lowering, just in the right module shape (the user kernel is well-formed; the runtime BC is a graph of mutually-recursive helpers). Validated end-to-end on Genesis G1 8192-env benchmark (CG, 15 iters, 500 steps, MI300X, FP32): pre-merge baseline: 1,790,770 env·steps/s post-merge, no fixes: 1,386,055 env·steps/s (-22.6%) fixes #1+#2 only: 1,569,491 env·steps/s (-12.4%) fixes #1+#2 + this commit: 1,791,235 env·steps/s (+0.03%) Recovers the full 22.6pp regression. The previously-suspected "kernel function attributes (uniform-work-group-size, waves-per-eu, fast-math) flipped under LLVM 22" follow-up is no longer a gap to close — `compile_module_to_hsaco` already reapplies those defaults to AMDGPU_KERNEL functions (lines 75-86), and post this commit the benchmark sits on the baseline regardless. Leaving them as-is. Co-authored-by: Cursor <cursoragent@cursor.com>

v01dXYZ and others added 30 commits March 26, 2026 11:01

[Misc] Warn user to disable caching when print_ir/QD_DUMP_IR enabled (G…

0fc25c9

…enesis-Embodied-AI#425) Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>

[Build] Pin torch version to CUDA 12.8 for CUDA tests (Genesis-Embodi…

27c34a2

…ed-AI#428)

[Misc] Fixing up taichi-dev urls (Genesis-Embodied-AI#429)

412108b

[Perf] Rename cuda_graph to gpu_graph across the codebase (Genesis-Em…

55cbdd3

…bodied-AI#430)

Misc: fix typo integeral -> integral (Genesis-Embodied-AI#434)

1dae239

Co-authored-by: v01dxyz <v01dxyz@v01d.xyz>

[Perf] CUDA graph 4: call from multiple locations (Genesis-Embodied-A…

fa369e9

…I#420)

[Bug] Fix fastcache not restoring graph_do_while_arg (Genesis-Embodie…

95c10ef

…d-AI#435)

[Perf] Cache last-call result in perf_dispatch for single-compatible …

4a73a2e

…case (Genesis-Embodied-AI#438)

Fix gpu_graph fallback on old Nvidia GPU. (Genesis-Embodied-AI#443)

22c2b6d

Fix shared memory offset not reset between CUDA kernels. (Genesis-Emb…

e98b7a9

…odied-AI#442)

[Misc] Allow disabling GPU graph via QD_GPU_GRAPH=0 env var (Genesis-…

f2fc6db

…Embodied-AI#439)

[Misc] Add named top-level loops (Genesis-Embodied-AI#440)

8b8f12c

[Misc] Rename gpu_graph to graph (Genesis-Embodied-AI#446)

a451902

[Misc] Add cross-platform shuffle (Genesis-Embodied-AI#447)

0c2fe7a

[Bug] Fix graph_do_while on Windows: search for cudadevrt.lib (Genesi…

9053fe4

…s-Embodied-AI#456)

[Bug] Also search default CUDA toolkit install location on Windows (G…

f800281

…enesis-Embodied-AI#461)

[SPIRV] Feature Parity Atomics & Shared Array (Genesis-Embodied-AI#432)

79ec049

[Misc] Change clang format to 120 characters (Genesis-Embodied-AI#463)

e72a9ec

[Misc] CUDA graph 5 Add fatbin (Genesis-Embodied-AI#464)

83e4b3b

[Bug] Reuse VkInstance across init/reset cycles (Genesis-Embodied-AI#465

f81ecab

)

[Perf] Tiles 1: _load, _store, _eye_ (Genesis-Embodied-AI#466)

dc4ef27

[Misc] Remove dead InternalFuncStmt type_check override (Genesis-Embo…

13f18a5

…died-AI#471)

[Perf] Tiles 2: add cholesky and ger (Genesis-Embodied-AI#472)

8e30533

[Perf] Tiles 2b: add triangular solve (Genesis-Embodied-AI#474)

f5847f8

[Misc] Refactor: use _get_col/_set_col in tiles load/store/init (Gene…

e0f91a2

…sis-Embodied-AI#475)

[Build] Fix flaky test_clock_accuracy (Genesis-Embodied-AI#436)

6ac8d2e

Fix AARCH64 emitting invalid asm in CUDA kernels. (Genesis-Embodied-A…

5e8efa6

…I#473) Co-authored-by: Hugh Perkins <hughperkins@gmail.com>

[AMDGPU] Enable HIP memory pool and surface pool-exhaustion errors. (G…

40eeb2f

…enesis-Embodied-AI#485)

[AMDGPU] Scope hsaco tmp dir per-user to avoid collisions. (Genesis-E…

9f7e43b

…mbodied-AI#484)

[Perf] Tiles 3: Add slice syntax, qd.outer() and initial doc (Genesis…

fe104f9

…-Embodied-AI#477)

hughperkins and others added 24 commits May 4, 2026 10:59

[Doc] Update README (Genesis-Embodied-AI#617)

95520fe

Co-authored-by: Cursor <cursoragent@cursor.com>

[CI] Fix coverage report showing def lines as uncovered (Genesis-Embo…

5c6872c

…died-AI#623) Co-authored-by: Cursor <cursoragent@cursor.com>

[Perf] Generic launcher: persistent context, JIT-pointer reuse, Metal…

4e35bd5

… compute encoder, LLVM-GPU async memory ops (Part 1/2) (Genesis-Embodied-AI#619)

[CI] Encode Python-first testing policy in coverage-check prompt (Gen…

ab20656

…esis-Embodied-AI#622) Co-authored-by: Cursor <cursoragent@cursor.com>

[CI] Add PR Line change report (Genesis-Embodied-AI#624)

d16692c

Co-authored-by: Cursor <cursoragent@cursor.com>

[CI] Disable quadrants pytest plugin during quadrants internal covera…

65d1d37

…ge runs (Genesis-Embodied-AI#629) Co-authored-by: Cursor <cursoragent@cursor.com>

[AutoDiff] Adstack load+store eliminations: EliminateRecomputableAdSt…

2fc5472

…ackPushes pass + leaf extensions (Genesis-Embodied-AI#621)

[CI] Simplify coverage PR comment to a single linked line (Genesis-Em…

88b987b

…bodied-AI#630)

[CUDA] Add AGX Thor, SM_110 (Genesis-Embodied-AI#631)

3c93df0

Co-authored-by: Johnny Nunez and Hugh Perkins

[CI] Lines changed report: collapse PR comment to a single linked tot…

9977672

…als line (Genesis-Embodied-AI#632)

[FEATURE] Support external Metal command queue via qd.init (Genesis-E…

718bb69

…mbodied-AI#618) Co-authored-by: Cursor <cursoragent@cursor.com>

[Perf] Cache adstack-sizer metadata per task across SPIR-V + LLVM-GPU…

eef0712

…; per-snode / DeviceAllocation invalidation (Part 2/2) (Genesis-Embodied-AI#620)

[AutoDiff] Disable EliminateRecomputableAdStackPushes pending mutated…

501c11c

…-SNode chain-leaf fix (Genesis-Embodied-AI#633)

[AutoDiff] Adstack chain-clone safety: mutated-SNode leaf reject + lo…

5884fad

…ad_top consumer-aware guard (Genesis-Embodied-AI#634)

[Docs] Add user-guide page for qd.simt.block.* primitives (Genesis-Em…

331aec2

…bodied-AI#638)

[Docs] Expand qd.simt.subgroup user-guide page to cover every op (Gen…

1ac4cb8

…esis-Embodied-AI#639)

[Perf] Streams 1-4 (Genesis-Embodied-AI#410)

8aad4eb

[Docs] Add user-guide page for matrix decompositions and solvers (Gen…

98611d3

…esis-Embodied-AI#643)

[Bug] Revert "[Perf] Streams 1-4 (Genesis-Embodied-AI#410)" (Genesis-…

47d5750

…Embodied-AI#650)

[Docs] Add user-guide page for atomics and bit operations (Genesis-Em…

8e875e1

…bodied-AI#640)

[Docs] Add user-guide page for qd.simt.grid.* primitives (Genesis-Emb…

4155735

…odied-AI#641)

npoulad1 and others added 5 commits May 8, 2026 18:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge/upstream main 2026 05 08#31

Merge/upstream main 2026 05 08#31
npoulad1 wants to merge 142 commits intoamd-integrationfrom
merge/upstream-main-2026-05-08

npoulad1 commented May 8, 2026

Uh oh!

npoulad1 commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

npoulad1 commented May 8, 2026

Summary

AMD-specific kept

Upstream features adopted

Conflicted files (28)

Uh oh!

npoulad1 commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants