Fix Warp Builds on The Rock 7.12#7
Conversation
| { \ | ||
| do { \ | ||
| bool out = (check_any(code)); \ | ||
| if(!out) { \ | ||
| return out; \ | ||
| } \ | ||
| } while(0); \ | ||
| } |
There was a problem hiding this comment.
There are a lot of formatting changes here. Are you using pre-commit?
There was a problem hiding this comment.
Yes, I ran pre-commit
| std::string clang_res_include | ||
| = rocm_path + "/lib/llvm/lib/clang/" + std::to_string(__clang_major__) + "/include"; | ||
| stored_options.push_back(std::string("-I") + clang_res_include); | ||
| opts.push_back(stored_options.back().c_str()); | ||
|
|
||
| // ROCm include directory for HIP runtime headers | ||
| std::string rocm_include = rocm_path + "/include"; | ||
| stored_options.push_back(std::string("-I") + rocm_include); | ||
| opts.push_back(stored_options.back().c_str()); |
There was a problem hiding this comment.
Couldn't we also set ENV variables at runtime to point HIPRTC at this directories? Is this a problem with theRock if we have to do this?
There was a problem hiding this comment.
We can do that. But for some reason the issue shows up only for TheRock 7.11+
|
@rtmadduri how are you installing pytorch here bc if you're pip installing if from the nightly index, it will pull in its own rocm (and since it's nightly it'll be 7.13). I don't think it's behind the problem, but it's worth keeping an eye on I was able to build and Here's a reproducer: # build with:
# docker buildx build --progress plain --build-context warp_src=$(pwd) -f docker/rocm_ci/Dockerfile.test --target warp_build -t warp:test .
ARG BASE_IMAGE=ubuntu:24.04
FROM ${BASE_IMAGE} AS warp_build
WORKDIR /root/
ENV MAX_JOBS=128
ARG THE_ROCK_VERSION=7.12.0a20260309
ARG GFX_FAMILY="gfx94X-dcgpu"
ARG TORCH_VERSION=2.9.1
ARG PY_VERSION=3.12
# Install Ubuntu dependencies
ENV DEBIAN_FRONTEND=noninteractive
RUN apt update \
&& apt install -y \
git \
git-lfs \
libtool \
libegl1-mesa-dev \
g++-14 \
wget \
sudo \
curl \
libstdc++-14-dev \
libdw1 \
libdrm-dev \
ccache \
&& apt clean \
&& rm -rf /var/lib/apt/lists/* \
&& update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-14 100 \
&& update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-14 100
# python deps
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
RUN uv venv -p ${PY_VERSION} /opt/venv
ENV VIRTUAL_ENV=/opt/venv
ENV PATH="/opt/venv/bin:$PATH"
RUN uv pip install cmake pybind11 build ninja scikit-build-core setuptools-scm numpy pytest pytest-xdist Pillow && \
uv pip install --index-url https://rocm.nightlies.amd.com/v2/${GFX_FAMILY}/ "rocm[libraries,devel]"==${THE_ROCK_VERSION} torch==${TORCH_VERSION} && \
rocm-sdk init
# build + install
# COPY . /root/warp/
RUN git clone https://github.com/rtmadduri/warp.git -b build/fix-warp-builds-on-the-rock-712 /root/warp
WORKDIR /root/warp
RUN cat <<'EOF' > build.sh
set -exuo pipefail
export ROCM_PATH=$(rocm-sdk path --root)
export LD_LIBRARY_PATH=$(rocm-sdk path --root)/lib
python build_lib.py --jobs $(nproc) 2>&1 | tee build_lib_${THE_ROCK_VERSION}.log
python -m build --wheel -C--build-option=-Plinux-x86_64
uv pip install dist/warp_lang-*.whl
python warp/tests/test_modules_lite.py
uv pip list
EOF
RUN /bin/bash build.sh |
|
@rtmadduri have you tried other options to limit the inlining rather than turning it off completely? e.g. Godbolt example: https://godbolt.org/z/zKjjajv6K |
Environment
Description
When compiling the Warp's reduce.cu with hipcc from TheROCk 7.12 targeting gfx942, the device-side compilation stage (clang-22 -cc1 -triple amdgcn-amd-amdhsa) takes 10+ minutes at any optimization level ≥ -O1, effectively appearing to hang. The same file compiles successfully in ~2 seconds with TheROCk 7.10.
The issue is isolated to the AMDGCN device code compilation phase (step 1 of the hipcc pipeline). The process is not deadlocked — it consumes 100% CPU and ~1 GB RSS — but the LLVM optimizer runs for an excessive duration due to a interaction between the function inlining pass and heavily-templated hipcub/rocPRIM reduction kernel code.
Trigger Code
The file
reduce.cuinstantiateshipcub::DeviceReduce::Sum (backed by rocprim::device_reduce)with two custom iterator types:cub_strided_iterator<T>— a strided random-access iteratorcub_inner_product_iterator<ElemT, ScalarT>— a dual-pointer inner-product iteratorThese are instantiated across multiple types:
float,double,wp::vec_t<2,T>,wp::vec_t<3,T>,wp::vec_t<4,T>. The combination of deep template nesting in rocPRIM's reduction implementation and the custom iterators produces a large amount of device IR that triggers the inlining behavior.Root Cause Analysis
The compilation time is caused by the device-side function inlining pass in the AMDGCN backend of AMD clang 22. When the optimizer inlines the heavily-templated rocPRIM reduction kernels (instantiated via
hipcub::DeviceReduce::Sumwith custom iterators), the resulting IR size explodes, causing downstream optimization passes to run for an exponential amount of time.Workaround
Add
-Xarch_device -fno-inlineto the hipcc invocation forreduce.cu. This disables device-side function inlining while preserving all other -O3 optimizations (constant propagation, loop optimizations, dead code elimination, etc.). Host-side code remains at full -O3.