ROCm · Micky774 · Feb 12, 2026 · Mar 5, 2026 · Mar 26, 2026 · Mar 31, 2026
@@ -0,0 +1,10 @@
+{
+  "extraKnownMarketplaces": {
+    "amd-claude-marketplace": {
+      "source": {
+        "source": "github",
+        "repo": "ROCm/amd-claude-marketplace"
+      }
+    }
+  }
+}
@@ -0,0 +1,112 @@
+# Agent instructions for TransformerEngine (ROCm fork)
+
+## Docker containers
+- We work in Docker containers for reproducibility.
+- Run build/test commands **only** inside the designated container (not on host).
+- If container is unspecified, ask for the exact image/tag and launch command **before** running anything expensive.
+- Prefer editable installs (`pip install -e .`).
+- Before debugging, record: container image/tag, ROCm version, GPU arch, TE commit, submodule state.
+- If results are suspicious, first verify you are in the expected container and that GPU devices/libs are exposed correctly.
+
+## Architecture
+- One core C++/HIP library + optional framework bindings:
+  - core: `transformer_engine/common` → `libtransformer_engine.so`
+  - PyTorch: `transformer_engine/pytorch` + `transformer_engine/pytorch/csrc`
+  - JAX: `transformer_engine/jax` + `transformer_engine/jax/csrc/extensions`
+- Python import flow:
+  - framework selection: `transformer_engine/__init__.py` (`NVTE_FRAMEWORK` = `pytorch|jax|all|none`)
+  - `.so` loading: `transformer_engine/common/__init__.py` (`load_framework_extension`)
+- Build orchestration: `setup.py` + `build_tools/*.py` + CMake.
+  - `build_tools/utils.py::rocm_build()` auto-detects ROCm first, then CUDA, unless `NVTE_USE_ROCM` is set.
+- 3rdparty submodules: `aiter`, `aotriton`, `cudnn-frontend`, `cutlass`, `googletest`, `hipify_torch`.
+
+## Hipify convention
+The build auto-generates HIP files from CUDA sources via `hipify_torch`. Generated files are marked with `// !!! This is a file automatically generated by hipify!!!` at line 1. **Never edit generated files directly** — edit the CUDA source instead.
+
+File extension mapping:
+
+| CUDA source | Generated HIP file |
+|---|---|
+| `.cu` | `.hip` |
+| `.cuh` | `_hip.cuh` |
+| `.cpp` | `_hip.cpp` |
+| `.h` | `_hip.h` |
+
+The following directories are **excluded** from hipify (native ROCm code — edit directly):
+- `transformer_engine/common/ck_fused_attn/` — CK kernel wrappers
+- `transformer_engine/common/amd_detail/` — AMD-specific utilities
+- `transformer_engine/common/rocshmem_api/` — ROCshmem wrappers
+
+Framework bindings (`pytorch/csrc`, `jax/csrc`) are hipified separately via `build_tools/pytorch.py` and `build_tools/jax.py`.
+
+## Fused attention backends
+Backends are gated by env vars (set to `0` to disable, unset or `1` to enable):
+
+| Env var | Controls | Default |
+|---|---|---|
+| `NVTE_FUSED_ATTN` | Master toggle for all fused attention | `1` |
+| `NVTE_FUSED_ATTN_CK` | CK backend | inherits `NVTE_FUSED_ATTN` |
+| `NVTE_FUSED_ATTN_AOTRITON` | AOTriton backend | inherits `NVTE_FUSED_ATTN` |
+| `NVTE_FLASH_ATTN` | Flash attention | `1` |
+
+CI backend configs (`ci/_utils.sh::configure_fused_attn_env`): `auto`, `ck`, `aotriton`, `flash`, `unfused`.
+
+### ROCm fused-attn file layout
+- **Runtime backend selection/dispatch**: `transformer_engine/common/fused_attn_rocm/fused_attn.cpp` (hipified)
+- **CK dispatch glue**: `transformer_engine/common/fused_attn_rocm/fused_attn_ck.cpp` (hipified)
+- **AOTriton dispatch glue**: `transformer_engine/common/fused_attn_rocm/fused_attn_aotriton.cpp` (hipified)
+- **CK kernel wrappers** (native, not hipified):
+  - `transformer_engine/common/ck_fused_attn/src/ck_fused_attn_{fwd,bwd,utils}.cpp`
+  - `transformer_engine/common/ck_fused_attn/include/ck_fused_attn/ck_fused_attn.hpp`
+
+### Debug logging env vars
+- `NVTE_DEBUG=1` + `NVTE_DEBUG_LEVEL={0,1,2}` — Python-level attention debug output
+- `NVTE_LOG_FUSED_ATTN_CONFIG=1` — C++ backend selection logging
+- `NVTE_LOG_CK_CONFIG=1` — CK-specific config logging
+- `NVTE_LOG_AOTRITON_CONFIG=1` — AOTriton-specific config logging
+- `CK_FUSED_ATTN_LOG_CONFIG=1` — CK kernel wrapper logging
+
+## Developer workflows
+- Always init submodules first: `git submodule update --init --recursive`.
+- Source install: `pip install . --no-build-isolation`.
+- C++ tests: `ci/core.sh`.
+- Framework CI tests (shell scripts, not bare pytest):
+  - PyTorch: `ci/pytorch.sh` | JAX: `ci/jax.sh`
+  - Control via `TEST_LEVEL`, `TEST_SGPU`, `TEST_MGPU`, `TEST_FILTER` (from `ci/_utils.sh`).
+
+## Code conventions
+- Edit `transformer_engine/*`, `build_tools/*`, `tests/*`, `ci/*`; avoid `3rdparty/*` unless explicitly required.
+- Keep env-var behavior stable; tests toggle flags intentionally.
+- Python: Black, line length 100. C/C++: cpplint + `.clang-format`.
+- **Preserve the existing style of each file you edit.** Much of the codebase originates from upstream, and style can vary file-to-file (naming conventions, comment style, control flow patterns, etc.). Before writing new code in a file, read enough of it to understand how similar logic is already written, and follow that style. Consistency within a file matters more than imposing a uniform style across the project.
+
+## Copyright headers
+When you modify a file, update its copyright header so the end-year reflects the current year.
+
+This repo carries **two** copyright lines — AMD and NVIDIA. Follow these rules:
+
+1. **Files with an existing AMD copyright line** — update the AMD end-year to the current year (e.g. `2025` → `2026`). Leave the NVIDIA line untouched.
+2. **Files with only an NVIDIA copyright line** — add an AMD line **above** the NVIDIA line:
+   - Python: `# Copyright (c) <YEAR>, Advanced Micro Devices, Inc. All rights reserved.`
+   - C/C++/HIP: `/* Copyright (c) <YEAR>, Advanced Micro Devices, Inc. All rights reserved. */` (or use the `*`-block style matching the file).
+   - `<YEAR>` is the current year (single year) for newly-added lines, e.g. `2026`.
+3. **New files you create** — include both AMD and NVIDIA headers with the current year, followed by a blank comment line and `See LICENSE for license information.`
+4. **Never change the NVIDIA copyright year range** — those dates are updated during IFU (integrate from upstream) merges.
+
+AMD headers are our addition and should stay consistent with the patterns already in the codebase.
+
+## Memory management
+When writing or updating memories in the project memory directory, follow these guidelines:
+
+- **Scope**: only save information that will be useful in future conversations. Do not save ephemeral task details, debugging breadcrumbs, or things derivable from the code/git history.
+- **Check before writing**: read `MEMORY.md` and check for an existing memory on the same topic before creating a new file. Update the existing memory instead of duplicating.
+- **File naming**: use short, descriptive, snake_case names (e.g. `aiter_build.md`, `container_setup.md`). Group by topic, not by date.
+- **Frontmatter**: every memory file must have the standard `name`, `description`, and `type` frontmatter fields.
+- **Index maintenance**: after creating or removing a memory file, update `MEMORY.md` to keep the index in sync. Each entry should be a single line under 150 characters.
+- **Staleness**: memories are point-in-time observations. When recalling a memory, verify it against current code/state before acting on it. Update or delete memories that are no longer accurate.
+
+## Troubleshooting pointers
+- **Missing `.so` on import**: check path resolution in `transformer_engine/common/__init__.py`.
+- **Framework extension won't build on ROCm**: check `build_tools/utils.py::get_frameworks()`.
+- **Fused-attn regression**: reproduce under multiple backend configs (`auto`, `ck`, `aotriton`, `unfused`).
+- **CK/AITER kernel failures**: use the `ck-debugging` skill for structured triage and isolation.