Fix Warp Builds on The Rock 7.12 by rtmadduri · Pull Request #7 · ROCm/warp

rtmadduri · 2026-03-26T21:30:52Z

Environment

TheROCk SDK: 7.12.0a20260309 (tarball: therock-dist-linux-gfx94X-dcgpu-7.12.0a20260309.tar.gz)

HIP version: 7.12.60610-3937beba96

AMD clang: 22.0.0git (ROCm/llvm-project.git c849bc16b0e49951d313756f20b73c2b28d321d7+PATCHED:9a6ac45c97a1e511db838c5b46257324d2de1780)

hipCUB: 4.3.0

rocPRIM: 4.3.0

OS: Ubuntu 24.04.4 LTS (Docker), kernel 6.8.0-31-generic

Target architecture: gfx942

Description

When compiling the Warp's reduce.cu with hipcc from TheROCk 7.12 targeting gfx942, the device-side compilation stage (clang-22 -cc1 -triple amdgcn-amd-amdhsa) takes 10+ minutes at any optimization level ≥ -O1, effectively appearing to hang. The same file compiles successfully in ~2 seconds with TheROCk 7.10.

The issue is isolated to the AMDGCN device code compilation phase (step 1 of the hipcc pipeline). The process is not deadlocked — it consumes 100% CPU and ~1 GB RSS — but the LLVM optimizer runs for an excessive duration due to a interaction between the function inlining pass and heavily-templated hipcub/rocPRIM reduction kernel code.

Trigger Code

The file reduce.cu instantiates hipcub::DeviceReduce::Sum (backed by rocprim::device_reduce) with two custom iterator types:

cub_strided_iterator<T> — a strided random-access iterator
cub_inner_product_iterator<ElemT, ScalarT> — a dual-pointer inner-product iterator

These are instantiated across multiple types: float, double, wp::vec_t<2,T>, wp::vec_t<3,T>, wp::vec_t<4,T>. The combination of deep template nesting in rocPRIM's reduction implementation and the custom iterators produces a large amount of device IR that triggers the inlining behavior.

# Hangs (>10 min) — full -O3 device compilation
hipcc -x hip -std=c++17 -O3 -fPIC --offload-arch=gfx942 \
  -DWP_ENABLE_CUDA=1 -I"<warp>/warp/native" -DWP_ENABLE_MATHDX=0 \
  -o reduce.cu.o -c "<warp>/warp/native/reduce.cu"
# Isolate to device compilation only (same hang):
# Extract the cc1 invocation with: hipcc <same flags> -### 2>&1
# Then run the first clang-22 -cc1 -triple amdgcn-amd-amdhsa ... command directly

Root Cause Analysis

The compilation time is caused by the device-side function inlining pass in the AMDGCN backend of AMD clang 22. When the optimizer inlines the heavily-templated rocPRIM reduction kernels (instantiated via hipcub::DeviceReduce::Sum with custom iterators), the resulting IR size explodes, causing downstream optimization passes to run for an exponential amount of time.

Workaround

Add -Xarch_device -fno-inline to the hipcc invocation for reduce.cu. This disables device-side function inlining while preserving all other -O3 optimizations (constant propagation, loop optimizations, dead code elimination, etc.). Host-side code remains at full -O3.

hipcc -x hip -std=c++17 -O3 -Xarch_device -fno-inline -fPIC --offload-arch=gfx942 \
  -DWP_ENABLE_CUDA=1 -I"<warp>/warp/native" -DWP_ENABLE_MATHDX=0 \
  -o reduce.cu.o -c "<warp>/warp/native/reduce.cu"

jamesETsmith

Running the tests locally now, I'll report back soon

jamesETsmith · 2026-03-27T00:38:13Z

+ { \
+     do { \
+         bool out = (check_any(code)); \
+         if(!out) { \
+             return out; \
+         } \
+     } while(0); \
+ }


There are a lot of formatting changes here. Are you using pre-commit?

Yes, I ran pre-commit

jamesETsmith · 2026-03-27T00:43:17Z

+        std::string clang_res_include
+            = rocm_path + "/lib/llvm/lib/clang/" + std::to_string(__clang_major__) + "/include";
+        stored_options.push_back(std::string("-I") + clang_res_include);
+        opts.push_back(stored_options.back().c_str());
+
+        // ROCm include directory for HIP runtime headers
+        std::string rocm_include = rocm_path + "/include";
+        stored_options.push_back(std::string("-I") + rocm_include);
+        opts.push_back(stored_options.back().c_str());


Couldn't we also set ENV variables at runtime to point HIPRTC at this directories? Is this a problem with theRock if we have to do this?

We can do that. But for some reason the issue shows up only for TheRock 7.11+

jamesETsmith · 2026-03-27T02:04:45Z

@rtmadduri how are you installing pytorch here bc if you're pip installing if from the nightly index, it will pull in its own rocm (and since it's nightly it'll be 7.13). I don't think it's behind the problem, but it's worth keeping an eye on

I was able to build and warp/tests/test_modules_lite.py passed for me

Here's a reproducer:

# build with:
# docker buildx build --progress plain --build-context warp_src=$(pwd) -f docker/rocm_ci/Dockerfile.test --target warp_build -t warp:test .
ARG BASE_IMAGE=ubuntu:24.04
FROM ${BASE_IMAGE} AS warp_build

WORKDIR /root/
ENV MAX_JOBS=128
ARG THE_ROCK_VERSION=7.12.0a20260309
ARG GFX_FAMILY="gfx94X-dcgpu"
ARG TORCH_VERSION=2.9.1
ARG PY_VERSION=3.12

# Install Ubuntu dependencies
ENV DEBIAN_FRONTEND=noninteractive
RUN apt update \
    && apt install -y \
    git \
    git-lfs \
    libtool \
    libegl1-mesa-dev \
    g++-14 \
    wget \
    sudo \
    curl \
    libstdc++-14-dev \
    libdw1 \
    libdrm-dev \
    ccache \
    && apt clean \
    && rm -rf /var/lib/apt/lists/* \
    && update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-14 100 \
    && update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-14 100

# python deps
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
RUN uv venv -p ${PY_VERSION} /opt/venv
ENV VIRTUAL_ENV=/opt/venv
ENV PATH="/opt/venv/bin:$PATH"
RUN uv pip install cmake pybind11 build ninja scikit-build-core setuptools-scm numpy pytest pytest-xdist Pillow && \
    uv pip install --index-url https://rocm.nightlies.amd.com/v2/${GFX_FAMILY}/ "rocm[libraries,devel]"==${THE_ROCK_VERSION} torch==${TORCH_VERSION} && \
    rocm-sdk init

# build + install
# COPY . /root/warp/
RUN git clone https://github.com/rtmadduri/warp.git -b build/fix-warp-builds-on-the-rock-712 /root/warp
WORKDIR /root/warp

RUN cat <<'EOF' > build.sh
set -exuo pipefail
export ROCM_PATH=$(rocm-sdk path --root)
export LD_LIBRARY_PATH=$(rocm-sdk path --root)/lib
python build_lib.py --jobs $(nproc) 2>&1 | tee build_lib_${THE_ROCK_VERSION}.log
python -m build --wheel -C--build-option=-Plinux-x86_64
uv pip install dist/warp_lang-*.whl
python warp/tests/test_modules_lite.py
uv pip list
EOF
RUN /bin/bash build.sh

jamesETsmith · 2026-03-27T16:19:13Z

@rtmadduri have you tried other options to limit the inlining rather than turning it off completely? e.g. -mllvm -inline-threshold=100?

Godbolt example: https://godbolt.org/z/zKjjajv6K

Fix Warp Builds on The Rock 7.12

c1f9022

rtmadduri self-assigned this Mar 26, 2026

rtmadduri added 2 commits March 26, 2026 17:44

run pre-commit

4b3065f

fix linter

2c4050b

jamesETsmith added the bug Something isn't working label Mar 27, 2026

jamesETsmith requested changes Mar 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Warp Builds on The Rock 7.12#7

Fix Warp Builds on The Rock 7.12#7
rtmadduri wants to merge 3 commits intoROCm:amd-integrationfrom
rtmadduri:build/fix-warp-builds-on-the-rock-712

rtmadduri commented Mar 26, 2026

Uh oh!

jamesETsmith left a comment •

edited

Loading

Uh oh!

jamesETsmith Mar 27, 2026

Uh oh!

rtmadduri Mar 27, 2026

Uh oh!

jamesETsmith Mar 27, 2026

Uh oh!

rtmadduri Mar 27, 2026

Uh oh!

jamesETsmith commented Mar 27, 2026 •

edited

Loading

Uh oh!

jamesETsmith commented Mar 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rtmadduri commented Mar 26, 2026

Environment

Description

Trigger Code

Root Cause Analysis

Workaround

Uh oh!

jamesETsmith left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jamesETsmith Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

rtmadduri Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

jamesETsmith Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

rtmadduri Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

jamesETsmith commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jamesETsmith commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jamesETsmith left a comment •

edited

Loading

jamesETsmith commented Mar 27, 2026 •

edited

Loading

jamesETsmith commented Mar 27, 2026 •

edited

Loading