draft: remove failing assertions#6
Conversation
remove comment containing 'johnmc'
remove comment containing 'johnmc'
jrmadsen
left a comment
There was a problem hiding this comment.
Hi @jmellorcrummey, I think these should just move down below the tracing::populate_external_correlation_ids(…) function call that is a little further down in the code.
|
Actually, I was mistaken. It appears this condition can arise if you are using both callback and buffer tracing of an API in the same context: which ends up with one external correlation id map entry, one callback entry, and one buffer entry. Are you intentionally doing this? |
|
We are intentionally doing this at present. It should be allowed. That's why I think the assertion should be deleted. For the future, we aim to switch to the external correlation id support rather than using the callback. |
aovid the need to patch AMD's amdgpu driver for AFAR V
|
This pull request has merge conflicts that need to be resolved. It cannot be imported to the ROCm/rocm-systems repo automatically. |
…rocprofv3 --pmc + roccap play Problem: Running 'rocprofv3 -A absolute --pmc SQ_WAVES -- roccap play <trace>' on MI300X (gfx942, ROCm 7.1) caused SSH disconnect and node destabilization. Root Cause (3 layers): 1. rocplaycap exec chain: librocprofiler-sdk-tool.so (LD_PRELOAD) re-initialized in every child process spawned by roccap, opening /dev/kfd multiple times under the same PID with different mm_struct. 2. KFD NULL ptr deref: duplicate /dev/kfd open caused kfd_create_process to fail, leaving a partially-initialized kfd_process struct. kfd_process_wq_release then dereferenced NULL, triggering use-after-free and GPU context corruption. 3. rocprofiler-sdk teardown crash: HSA runtime tears down before rocprofiler finalization completes, causing: - queue_controller_fini() calling hsa_signal_wait on freed HSA runtime -> SIGABRT - AQL completion callbacks firing after output buffers destroyed -> ROCP_FATAL - Signal handler deadlock: parent/child waiting on each other -> hang Fixes: - tool.cpp: skip rocprofiler_configure() when ROCPROFV3_PLAYBACK_CHILD env is set, preventing /dev/kfd open in rocplaycap scan-only child processes (Fix ROCM-1214 #1) - tool.cpp: guard initialize_rocprofv3() against null client_identifier when running as scan-only child (Fix ROCM-1214 #2) - config.hpp: resolve merge conflict in enable_process_sync default value, use false to prevent process sync hang during roccap replay (Fix ROCM-1214 #3) - queue_controller.cpp: skip Queue::sync() in queue_controller_fini() to avoid calling hsa_signal_wait after HSA runtime teardown (Fix ROCM-1214 #4) - device_counting.cpp: replace ROCP_FATAL with ROCP_WARNING when output buffer is destroyed before AQL completion callback fires (Fix ROCM-1214 #5) - sample_processing.cpp: replace CHECK_NOTNULL with explicit null check and graceful return when buffer is destroyed before sample callback (Fix ROCM-1214 #6) - tool.cpp: add 5-second timeout to wait_pid() to break parent/child deadlock in signal handler (Fix ROCM-1214 #7) - tool.cpp: skip chained SIGABRT handler to prevent recursive abort (Fix ROCM-1214 #8) - tool.cpp: flush buffers and call generate_output() before _exit() on SIGABRT path to ensure CSV output is written (Fix ROCM-1214 #9) - counters/sample_consumer.hpp: change cv.wait() to cv.wait_for(5s) in consumer_thread_t::exit() to prevent infinite hang on GPU context corruption Validation: Machine: banff-ccs-aus-g14-14 GPU: AMD Instinct MI300X (gfx942) ROCm: 7.1.0 (amdgpu 6.12.12-2208839) Command: rocprofv3 -A absolute --pmc SQ_WAVES --output-format csv -- roccap play trace.cap Result: SQ_WAVES=32768.000000, node stable, clean dmesg, exits in ~12 sec Note: rocplaycap companion fixes required in playback_main.cpp and playback_wrapper.cpp (separate Perforce changelist for roccap team).
we compiled the 'develop' versions of clr and hip as Ben recommended. the HIP and HSA assertions
ROCP_FATAL_IF(external_corr_ids.size() < (callback_contexts.size() + buffered_contexts.size()))
that I commented out out trip with each of external_corr_ids.size(), (callback_contexts.size(), and buffered_contexts.size() == 1. omitting them, our incomplete draft of rocprofiler-sdk support in hpctoolkit is being exercised as expected.