diff --git a/.claude/settings.json b/.claude/settings.json new file mode 100644 index 000000000..27d702496 --- /dev/null +++ b/.claude/settings.json @@ -0,0 +1,10 @@ +{ + "extraKnownMarketplaces": { + "amd-claude-marketplace": { + "source": { + "source": "github", + "repo": "ROCm/amd-claude-marketplace" + } + } + } +} diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 000000000..f7bfdc21f --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,112 @@ +# Agent instructions for TransformerEngine (ROCm fork) + +## Docker containers +- We work in Docker containers for reproducibility. +- Run build/test commands **only** inside the designated container (not on host). +- If container is unspecified, ask for the exact image/tag and launch command **before** running anything expensive. +- Prefer editable installs (`pip install -e .`). +- Before debugging, record: container image/tag, ROCm version, GPU arch, TE commit, submodule state. +- If results are suspicious, first verify you are in the expected container and that GPU devices/libs are exposed correctly. + +## Architecture +- One core C++/HIP library + optional framework bindings: + - core: `transformer_engine/common` → `libtransformer_engine.so` + - PyTorch: `transformer_engine/pytorch` + `transformer_engine/pytorch/csrc` + - JAX: `transformer_engine/jax` + `transformer_engine/jax/csrc/extensions` +- Python import flow: + - framework selection: `transformer_engine/__init__.py` (`NVTE_FRAMEWORK` = `pytorch|jax|all|none`) + - `.so` loading: `transformer_engine/common/__init__.py` (`load_framework_extension`) +- Build orchestration: `setup.py` + `build_tools/*.py` + CMake. + - `build_tools/utils.py::rocm_build()` auto-detects ROCm first, then CUDA, unless `NVTE_USE_ROCM` is set. +- 3rdparty submodules: `aiter`, `aotriton`, `cudnn-frontend`, `cutlass`, `googletest`, `hipify_torch`. + +## Hipify convention +The build auto-generates HIP files from CUDA sources via `hipify_torch`. Generated files are marked with `// !!! This is a file automatically generated by hipify!!!` at line 1. **Never edit generated files directly** — edit the CUDA source instead. + +File extension mapping: + +| CUDA source | Generated HIP file | +|---|---| +| `.cu` | `.hip` | +| `.cuh` | `_hip.cuh` | +| `.cpp` | `_hip.cpp` | +| `.h` | `_hip.h` | + +The following directories are **excluded** from hipify (native ROCm code — edit directly): +- `transformer_engine/common/ck_fused_attn/` — CK kernel wrappers +- `transformer_engine/common/amd_detail/` — AMD-specific utilities +- `transformer_engine/common/rocshmem_api/` — ROCshmem wrappers + +Framework bindings (`pytorch/csrc`, `jax/csrc`) are hipified separately via `build_tools/pytorch.py` and `build_tools/jax.py`. + +## Fused attention backends +Backends are gated by env vars (set to `0` to disable, unset or `1` to enable): + +| Env var | Controls | Default | +|---|---|---| +| `NVTE_FUSED_ATTN` | Master toggle for all fused attention | `1` | +| `NVTE_FUSED_ATTN_CK` | CK backend | inherits `NVTE_FUSED_ATTN` | +| `NVTE_FUSED_ATTN_AOTRITON` | AOTriton backend | inherits `NVTE_FUSED_ATTN` | +| `NVTE_FLASH_ATTN` | Flash attention | `1` | + +CI backend configs (`ci/_utils.sh::configure_fused_attn_env`): `auto`, `ck`, `aotriton`, `flash`, `unfused`. + +### ROCm fused-attn file layout +- **Runtime backend selection/dispatch**: `transformer_engine/common/fused_attn_rocm/fused_attn.cpp` (hipified) +- **CK dispatch glue**: `transformer_engine/common/fused_attn_rocm/fused_attn_ck.cpp` (hipified) +- **AOTriton dispatch glue**: `transformer_engine/common/fused_attn_rocm/fused_attn_aotriton.cpp` (hipified) +- **CK kernel wrappers** (native, not hipified): + - `transformer_engine/common/ck_fused_attn/src/ck_fused_attn_{fwd,bwd,utils}.cpp` + - `transformer_engine/common/ck_fused_attn/include/ck_fused_attn/ck_fused_attn.hpp` + +### Debug logging env vars +- `NVTE_DEBUG=1` + `NVTE_DEBUG_LEVEL={0,1,2}` — Python-level attention debug output +- `NVTE_LOG_FUSED_ATTN_CONFIG=1` — C++ backend selection logging +- `NVTE_LOG_CK_CONFIG=1` — CK-specific config logging +- `NVTE_LOG_AOTRITON_CONFIG=1` — AOTriton-specific config logging +- `CK_FUSED_ATTN_LOG_CONFIG=1` — CK kernel wrapper logging + +## Developer workflows +- Always init submodules first: `git submodule update --init --recursive`. +- Source install: `pip install . --no-build-isolation`. +- C++ tests: `ci/core.sh`. +- Framework CI tests (shell scripts, not bare pytest): + - PyTorch: `ci/pytorch.sh` | JAX: `ci/jax.sh` + - Control via `TEST_LEVEL`, `TEST_SGPU`, `TEST_MGPU`, `TEST_FILTER` (from `ci/_utils.sh`). + +## Code conventions +- Edit `transformer_engine/*`, `build_tools/*`, `tests/*`, `ci/*`; avoid `3rdparty/*` unless explicitly required. +- Keep env-var behavior stable; tests toggle flags intentionally. +- Python: Black, line length 100. C/C++: cpplint + `.clang-format`. +- **Preserve the existing style of each file you edit.** Much of the codebase originates from upstream, and style can vary file-to-file (naming conventions, comment style, control flow patterns, etc.). Before writing new code in a file, read enough of it to understand how similar logic is already written, and follow that style. Consistency within a file matters more than imposing a uniform style across the project. + +## Copyright headers +When you modify a file, update its copyright header so the end-year reflects the current year. + +This repo carries **two** copyright lines — AMD and NVIDIA. Follow these rules: + +1. **Files with an existing AMD copyright line** — update the AMD end-year to the current year (e.g. `2025` → `2026`). Leave the NVIDIA line untouched. +2. **Files with only an NVIDIA copyright line** — add an AMD line **above** the NVIDIA line: + - Python: `# Copyright (c) , Advanced Micro Devices, Inc. All rights reserved.` + - C/C++/HIP: `/* Copyright (c) , Advanced Micro Devices, Inc. All rights reserved. */` (or use the `*`-block style matching the file). + - `` is the current year (single year) for newly-added lines, e.g. `2026`. +3. **New files you create** — include both AMD and NVIDIA headers with the current year, followed by a blank comment line and `See LICENSE for license information.` +4. **Never change the NVIDIA copyright year range** — those dates are updated during IFU (integrate from upstream) merges. + +AMD headers are our addition and should stay consistent with the patterns already in the codebase. + +## Memory management +When writing or updating memories in the project memory directory, follow these guidelines: + +- **Scope**: only save information that will be useful in future conversations. Do not save ephemeral task details, debugging breadcrumbs, or things derivable from the code/git history. +- **Check before writing**: read `MEMORY.md` and check for an existing memory on the same topic before creating a new file. Update the existing memory instead of duplicating. +- **File naming**: use short, descriptive, snake_case names (e.g. `aiter_build.md`, `container_setup.md`). Group by topic, not by date. +- **Frontmatter**: every memory file must have the standard `name`, `description`, and `type` frontmatter fields. +- **Index maintenance**: after creating or removing a memory file, update `MEMORY.md` to keep the index in sync. Each entry should be a single line under 150 characters. +- **Staleness**: memories are point-in-time observations. When recalling a memory, verify it against current code/state before acting on it. Update or delete memories that are no longer accurate. + +## Troubleshooting pointers +- **Missing `.so` on import**: check path resolution in `transformer_engine/common/__init__.py`. +- **Framework extension won't build on ROCm**: check `build_tools/utils.py::get_frameworks()`. +- **Fused-attn regression**: reproduce under multiple backend configs (`auto`, `ck`, `aotriton`, `unfused`). +- **CK/AITER kernel failures**: use the `ck-debugging` skill for structured triage and isolation.