Skip to content

draft: remove failing assertions#6

Open
jmellorcrummey wants to merge 5 commits intoROCm:amd-mainlinefrom
jmellorcrummey:amd-mainline
Open

draft: remove failing assertions#6
jmellorcrummey wants to merge 5 commits intoROCm:amd-mainlinefrom
jmellorcrummey:amd-mainline

Conversation

@jmellorcrummey
Copy link
Copy Markdown

we compiled the 'develop' versions of clr and hip as Ben recommended. the HIP and HSA assertions

ROCP_FATAL_IF(external_corr_ids.size() < (callback_contexts.size() + buffered_contexts.size()))

that I commented out out trip with each of external_corr_ids.size(), (callback_contexts.size(), and buffered_contexts.size() == 1. omitting them, our incomplete draft of rocprofiler-sdk support in hpctoolkit is being exercised as expected.

remove comment containing 'johnmc'
remove comment containing 'johnmc'
Copy link
Copy Markdown
Contributor

@jrmadsen jrmadsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jmellorcrummey, I think these should just move down below the tracing::populate_external_correlation_ids(…) function call that is a little further down in the code.

@jrmadsen
Copy link
Copy Markdown
Contributor

Actually, I was mistaken. It appears this condition can arise if you are using both callback and buffer tracing of an API in the same context:

extern_corr_ids.emplace(itr, empty_user_data);

extern_corr_ids.emplace(itr, empty_user_data);

which ends up with one external correlation id map entry, one callback entry, and one buffer entry. Are you intentionally doing this?

@jmellorcrummey
Copy link
Copy Markdown
Author

We are intentionally doing this at present. It should be allowed. That's why I think the assertion should be deleted.

For the future, we aim to switch to the external correlation id support rather than using the callback.

aovid the need to patch AMD's amdgpu driver for AFAR V
@jayhawk-commits
Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that need to be resolved. It cannot be imported to the ROCm/rocm-systems repo automatically.

ihhethan added a commit that referenced this pull request Mar 27, 2026
…rocprofv3 --pmc + roccap play

Problem:
Running 'rocprofv3 -A absolute --pmc SQ_WAVES -- roccap play <trace>' on MI300X
(gfx942, ROCm 7.1) caused SSH disconnect and node destabilization.

Root Cause (3 layers):
1. rocplaycap exec chain: librocprofiler-sdk-tool.so (LD_PRELOAD) re-initialized
   in every child process spawned by roccap, opening /dev/kfd multiple times
   under the same PID with different mm_struct.

2. KFD NULL ptr deref: duplicate /dev/kfd open caused kfd_create_process to fail,
   leaving a partially-initialized kfd_process struct. kfd_process_wq_release
   then dereferenced NULL, triggering use-after-free and GPU context corruption.

3. rocprofiler-sdk teardown crash: HSA runtime tears down before rocprofiler
   finalization completes, causing:
   - queue_controller_fini() calling hsa_signal_wait on freed HSA runtime -> SIGABRT
   - AQL completion callbacks firing after output buffers destroyed -> ROCP_FATAL
   - Signal handler deadlock: parent/child waiting on each other -> hang

Fixes:
- tool.cpp: skip rocprofiler_configure() when ROCPROFV3_PLAYBACK_CHILD env is set,
  preventing /dev/kfd open in rocplaycap scan-only child processes (Fix ROCM-1214 #1)

- tool.cpp: guard initialize_rocprofv3() against null client_identifier when
  running as scan-only child (Fix ROCM-1214 #2)

- config.hpp: resolve merge conflict in enable_process_sync default value,
  use false to prevent process sync hang during roccap replay (Fix ROCM-1214 #3)

- queue_controller.cpp: skip Queue::sync() in queue_controller_fini() to avoid
  calling hsa_signal_wait after HSA runtime teardown (Fix ROCM-1214 #4)

- device_counting.cpp: replace ROCP_FATAL with ROCP_WARNING when output buffer
  is destroyed before AQL completion callback fires (Fix ROCM-1214 #5)

- sample_processing.cpp: replace CHECK_NOTNULL with explicit null check and
  graceful return when buffer is destroyed before sample callback (Fix ROCM-1214 #6)

- tool.cpp: add 5-second timeout to wait_pid() to break parent/child deadlock
  in signal handler (Fix ROCM-1214 #7)

- tool.cpp: skip chained SIGABRT handler to prevent recursive abort (Fix ROCM-1214 #8)

- tool.cpp: flush buffers and call generate_output() before _exit() on SIGABRT
  path to ensure CSV output is written (Fix ROCM-1214 #9)

- counters/sample_consumer.hpp: change cv.wait() to cv.wait_for(5s) in
  consumer_thread_t::exit() to prevent infinite hang on GPU context corruption

Validation:
  Machine: banff-ccs-aus-g14-14
  GPU: AMD Instinct MI300X (gfx942)
  ROCm: 7.1.0 (amdgpu 6.12.12-2208839)
  Command: rocprofv3 -A absolute --pmc SQ_WAVES --output-format csv -- roccap play trace.cap
  Result: SQ_WAVES=32768.000000, node stable, clean dmesg, exits in ~12 sec

Note: rocplaycap companion fixes required in playback_main.cpp and
playback_wrapper.cpp (separate Perforce changelist for roccap team).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants