Fix: add AICore task execution timeout mechanism#718
Fix: add AICore task execution timeout mechanism#718ChaoZheng109 merged 1 commit intohw-native-sys:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces timeout mechanisms for AICore operation execution and stream synchronization to prevent potential hangs. Key changes include the addition of timeout constants, the integration of aclrtSetOpExecuteTimeOutV2 and aclrtSynchronizeStreamWithTimeout, and the implementation of a timeout loop within platform_deinit_aicore_regs. Feedback suggests using RAII for more robust device context management, optimizing the busy-wait loop in the deinit function to reduce bus contention, and addressing the risk of cumulative per-core timeouts exceeding the host-side synchronization limit.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces timeout mechanisms for AICore op execution, stream synchronization, and AICore deinitialization to prevent potential hangs. Key changes include the adoption of aclrtSetOpExecuteTimeOutV2 and aclrtSynchronizeStreamWithTimeout. Feedback highlights that aclrtSetOpExecuteTimeOutV2 expects the timeout value in seconds rather than microseconds, and that this setting should be applied to every thread attach because it is a per-context configuration.
Three-layer timeout chain to detect and recover from stuck AICore ops: - Layer 1 (STARS): call aclrtSetOpExecuteTimeOutV2 in attach_current_thread() to enable hardware-level AICore op execution monitoring (1s requested) - Layer 2 (AICPU): add 1s timeout to platform_deinit_aicore_regs() to prevent infinite wait when AICore is unresponsive (e.g. STARS-killed op); shutdown() logs and continues deiniting remaining cores - Layer 3 (Host): replace rtStreamSynchronize with aclrtSynchronizeStreamWithTimeout (2s) for both AICPU and AICore streams; return 507046 on timeout Timeout budget: scheduler idle (~200ms) + per-core deinit (1s each). Single-core case completes within host 2s timeout; multi-core case (block_dim > 1, all cores stuck) is backstopped by host timeout, with aclrtResetDevice handling final hardware cleanup.
Three-layer timeout chain to detect and recover from stuck AICore ops:
to enable hardware-level AICore op execution monitoring (1s requested)
infinite wait when AICore is unresponsive (e.g. STARS-killed op); shutdown()
logs and continues deiniting remaining cores
(2s) for both AICPU and AICore streams; return 507046 on timeout
Timeout budget: scheduler idle (~200ms) + per-core deinit (1s each).
Single-core case completes within host 2s timeout; multi-core case
(block_dim > 1, all cores stuck) is backstopped by host timeout,
with aclrtResetDevice handling final hardware cleanup.