Skip to content

Fix Warp Builds on The Rock 7.12#7

Open
rtmadduri wants to merge 3 commits intoROCm:amd-integrationfrom
rtmadduri:build/fix-warp-builds-on-the-rock-712
Open

Fix Warp Builds on The Rock 7.12#7
rtmadduri wants to merge 3 commits intoROCm:amd-integrationfrom
rtmadduri:build/fix-warp-builds-on-the-rock-712

Conversation

@rtmadduri
Copy link
Copy Markdown
Collaborator

Environment

TheROCk SDK: 7.12.0a20260309 (tarball: therock-dist-linux-gfx94X-dcgpu-7.12.0a20260309.tar.gz)

HIP version: 7.12.60610-3937beba96

AMD clang: 22.0.0git (ROCm/llvm-project.git c849bc16b0e49951d313756f20b73c2b28d321d7+PATCHED:9a6ac45c97a1e511db838c5b46257324d2de1780)

hipCUB: 4.3.0

rocPRIM: 4.3.0

OS: Ubuntu 24.04.4 LTS (Docker), kernel 6.8.0-31-generic

Target architecture: gfx942

Description

When compiling the Warp's reduce.cu with hipcc from TheROCk 7.12 targeting gfx942, the device-side compilation stage (clang-22 -cc1 -triple amdgcn-amd-amdhsa) takes 10+ minutes at any optimization level ≥ -O1, effectively appearing to hang. The same file compiles successfully in ~2 seconds with TheROCk 7.10.

The issue is isolated to the AMDGCN device code compilation phase (step 1 of the hipcc pipeline). The process is not deadlocked — it consumes 100% CPU and ~1 GB RSS — but the LLVM optimizer runs for an excessive duration due to a interaction between the function inlining pass and heavily-templated hipcub/rocPRIM reduction kernel code.

Trigger Code

The file reduce.cu instantiates hipcub::DeviceReduce::Sum (backed by rocprim::device_reduce) with two custom iterator types:

  1. cub_strided_iterator<T> — a strided random-access iterator
  2. cub_inner_product_iterator<ElemT, ScalarT> — a dual-pointer inner-product iterator

These are instantiated across multiple types: float, double, wp::vec_t<2,T>, wp::vec_t<3,T>, wp::vec_t<4,T>. The combination of deep template nesting in rocPRIM's reduction implementation and the custom iterators produces a large amount of device IR that triggers the inlining behavior.

# Hangs (>10 min) — full -O3 device compilation
hipcc -x hip -std=c++17 -O3 -fPIC --offload-arch=gfx942 \
  -DWP_ENABLE_CUDA=1 -I"<warp>/warp/native" -DWP_ENABLE_MATHDX=0 \
  -o reduce.cu.o -c "<warp>/warp/native/reduce.cu"
# Isolate to device compilation only (same hang):
# Extract the cc1 invocation with: hipcc <same flags> -### 2>&1
# Then run the first clang-22 -cc1 -triple amdgcn-amd-amdhsa ... command directly

Root Cause Analysis

The compilation time is caused by the device-side function inlining pass in the AMDGCN backend of AMD clang 22. When the optimizer inlines the heavily-templated rocPRIM reduction kernels (instantiated via hipcub::DeviceReduce::Sum with custom iterators), the resulting IR size explodes, causing downstream optimization passes to run for an exponential amount of time.

Workaround

Add -Xarch_device -fno-inline to the hipcc invocation for reduce.cu. This disables device-side function inlining while preserving all other -O3 optimizations (constant propagation, loop optimizations, dead code elimination, etc.). Host-side code remains at full -O3.

hipcc -x hip -std=c++17 -O3 -Xarch_device -fno-inline -fPIC --offload-arch=gfx942 \
  -DWP_ENABLE_CUDA=1 -I"<warp>/warp/native" -DWP_ENABLE_MATHDX=0 \
  -o reduce.cu.o -c "<warp>/warp/native/reduce.cu"

@rtmadduri rtmadduri self-assigned this Mar 26, 2026
@jamesETsmith jamesETsmith added the bug Something isn't working label Mar 27, 2026
Copy link
Copy Markdown
Collaborator

@jamesETsmith jamesETsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running the tests locally now, I'll report back soon

Comment thread warp/native/warp.cu
Comment on lines +68 to +75
{ \
do { \
bool out = (check_any(code)); \
if(!out) { \
return out; \
} \
} while(0); \
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of formatting changes here. Are you using pre-commit?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I ran pre-commit

Comment thread warp/native/warp.cu
Comment on lines +3966 to +3974
std::string clang_res_include
= rocm_path + "/lib/llvm/lib/clang/" + std::to_string(__clang_major__) + "/include";
stored_options.push_back(std::string("-I") + clang_res_include);
opts.push_back(stored_options.back().c_str());

// ROCm include directory for HIP runtime headers
std::string rocm_include = rocm_path + "/include";
stored_options.push_back(std::string("-I") + rocm_include);
opts.push_back(stored_options.back().c_str());
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we also set ENV variables at runtime to point HIPRTC at this directories? Is this a problem with theRock if we have to do this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do that. But for some reason the issue shows up only for TheRock 7.11+

@jamesETsmith
Copy link
Copy Markdown
Collaborator

jamesETsmith commented Mar 27, 2026

@rtmadduri how are you installing pytorch here bc if you're pip installing if from the nightly index, it will pull in its own rocm (and since it's nightly it'll be 7.13). I don't think it's behind the problem, but it's worth keeping an eye on

I was able to build and warp/tests/test_modules_lite.py passed for me

Here's a reproducer:

# build with:
# docker buildx build --progress plain --build-context warp_src=$(pwd) -f docker/rocm_ci/Dockerfile.test --target warp_build -t warp:test .
ARG BASE_IMAGE=ubuntu:24.04
FROM ${BASE_IMAGE} AS warp_build

WORKDIR /root/
ENV MAX_JOBS=128
ARG THE_ROCK_VERSION=7.12.0a20260309
ARG GFX_FAMILY="gfx94X-dcgpu"
ARG TORCH_VERSION=2.9.1
ARG PY_VERSION=3.12

# Install Ubuntu dependencies
ENV DEBIAN_FRONTEND=noninteractive
RUN apt update \
    && apt install -y \
    git \
    git-lfs \
    libtool \
    libegl1-mesa-dev \
    g++-14 \
    wget \
    sudo \
    curl \
    libstdc++-14-dev \
    libdw1 \
    libdrm-dev \
    ccache \
    && apt clean \
    && rm -rf /var/lib/apt/lists/* \
    && update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-14 100 \
    && update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-14 100

# python deps
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
RUN uv venv -p ${PY_VERSION} /opt/venv
ENV VIRTUAL_ENV=/opt/venv
ENV PATH="/opt/venv/bin:$PATH"
RUN uv pip install cmake pybind11 build ninja scikit-build-core setuptools-scm numpy pytest pytest-xdist Pillow && \
    uv pip install --index-url https://rocm.nightlies.amd.com/v2/${GFX_FAMILY}/ "rocm[libraries,devel]"==${THE_ROCK_VERSION} torch==${TORCH_VERSION} && \
    rocm-sdk init

# build + install
# COPY . /root/warp/
RUN git clone https://github.com/rtmadduri/warp.git -b build/fix-warp-builds-on-the-rock-712 /root/warp
WORKDIR /root/warp

RUN cat <<'EOF' > build.sh
set -exuo pipefail
export ROCM_PATH=$(rocm-sdk path --root)
export LD_LIBRARY_PATH=$(rocm-sdk path --root)/lib
python build_lib.py --jobs $(nproc) 2>&1 | tee build_lib_${THE_ROCK_VERSION}.log
python -m build --wheel -C--build-option=-Plinux-x86_64
uv pip install dist/warp_lang-*.whl
python warp/tests/test_modules_lite.py
uv pip list
EOF
RUN /bin/bash build.sh

@jamesETsmith
Copy link
Copy Markdown
Collaborator

jamesETsmith commented Mar 27, 2026

@rtmadduri have you tried other options to limit the inlining rather than turning it off completely? e.g. -mllvm -inline-threshold=100?

Godbolt example: https://godbolt.org/z/zKjjajv6K

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants