Skip to content

Fix: add AICore task execution timeout mechanism#718

Merged
ChaoZheng109 merged 1 commit intohw-native-sys:mainfrom
indigo1973:prof_0507
May 9, 2026
Merged

Fix: add AICore task execution timeout mechanism#718
ChaoZheng109 merged 1 commit intohw-native-sys:mainfrom
indigo1973:prof_0507

Conversation

@indigo1973
Copy link
Copy Markdown
Contributor

@indigo1973 indigo1973 commented May 7, 2026

Three-layer timeout chain to detect and recover from stuck AICore ops:

  • Layer 1 (STARS): call aclrtSetOpExecuteTimeOutV2 in attach_current_thread()
    to enable hardware-level AICore op execution monitoring (1s requested)
  • Layer 2 (AICPU): add 1s timeout to platform_deinit_aicore_regs() to prevent
    infinite wait when AICore is unresponsive (e.g. STARS-killed op); shutdown()
    logs and continues deiniting remaining cores
  • Layer 3 (Host): replace rtStreamSynchronize with aclrtSynchronizeStreamWithTimeout
    (2s) for both AICPU and AICore streams; return 507046 on timeout

Timeout budget: scheduler idle (~200ms) + per-core deinit (1s each).
Single-core case completes within host 2s timeout; multi-core case
(block_dim > 1, all cores stuck) is backstopped by host timeout,
with aclrtResetDevice handling final hardware cleanup.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces timeout mechanisms for AICore operation execution and stream synchronization to prevent potential hangs. Key changes include the addition of timeout constants, the integration of aclrtSetOpExecuteTimeOutV2 and aclrtSynchronizeStreamWithTimeout, and the implementation of a timeout loop within platform_deinit_aicore_regs. Feedback suggests using RAII for more robust device context management, optimizing the busy-wait loop in the deinit function to reduce bus contention, and addressing the risk of cumulative per-core timeouts exceeding the host-side synchronization limit.

Comment thread src/a2a3/platform/onboard/host/device_runner.cpp
Comment thread src/a2a3/platform/src/aicpu/platform_regs.cpp
@indigo1973
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces timeout mechanisms for AICore op execution, stream synchronization, and AICore deinitialization to prevent potential hangs. Key changes include the adoption of aclrtSetOpExecuteTimeOutV2 and aclrtSynchronizeStreamWithTimeout. Feedback highlights that aclrtSetOpExecuteTimeOutV2 expects the timeout value in seconds rather than microseconds, and that this setting should be applied to every thread attach because it is a per-context configuration.

Comment thread src/a2a3/platform/include/common/platform_config.h
Comment thread src/a2a3/platform/onboard/host/device_runner.cpp Outdated
Three-layer timeout chain to detect and recover from stuck AICore ops:

- Layer 1 (STARS): call aclrtSetOpExecuteTimeOutV2 in attach_current_thread()
  to enable hardware-level AICore op execution monitoring (1s requested)
- Layer 2 (AICPU): add 1s timeout to platform_deinit_aicore_regs() to prevent
  infinite wait when AICore is unresponsive (e.g. STARS-killed op); shutdown()
  logs and continues deiniting remaining cores
- Layer 3 (Host): replace rtStreamSynchronize with aclrtSynchronizeStreamWithTimeout
  (2s) for both AICPU and AICore streams; return 507046 on timeout

Timeout budget: scheduler idle (~200ms) + per-core deinit (1s each).
Single-core case completes within host 2s timeout; multi-core case
(block_dim > 1, all cores stuck) is backstopped by host timeout,
with aclrtResetDevice handling final hardware cleanup.
@ChaoZheng109 ChaoZheng109 merged commit 61cad74 into hw-native-sys:main May 9, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants