Skip to content

fix: stabilize A5 validation samples#642

Draft
HecreReed wants to merge 1 commit intohw-native-sys:mainfrom
HecreReed:a5-six-fixes
Draft

fix: stabilize A5 validation samples#642
HecreReed wants to merge 1 commit intohw-native-sys:mainfrom
HecreReed:a5-six-fixes

Conversation

@HecreReed
Copy link
Copy Markdown
Collaborator

Summary

  • fix A5 quant/quant_asym sample shapes and goldens to match per-row scale semantics
  • make mgather/mscatter validation generation deterministic and compare mscatter outputs by indices
  • skip partarg in remote validation when the vendored pto-isa lacks TPARTARG intrinsics
  • simplify the A5 abs sample to avoid unsupported dynamic partition dims in board validation

Validation

  • python3 -m py_compile test/npu_validation/scripts/generate_testcase.py test/samples/Abs/abs.py test/samples/Mgather/mgather.py test/samples/Mscatter/mscatter.py test/samples/Quant/quant.py test/samples/Quant/quant_asym.py test/samples/Quant/quant_golden.py test/samples/Quant/quant_asym_golden.py
  • bash -n test/npu_validation/scripts/run_remote_npu_validation.sh
  • A5 board on 192.168.1.52: abs / quant / quant_asym passed earlier in the same validation workspace
  • A5 board on 192.168.1.52: mgather passed after regenerating testcase from runop output
  • A5 board on 192.168.1.52: mscatter passed after fixing indexed compare generation
  • A5 board on 192.168.1.52: partarg skips as expected because vendored pto-isa is missing TPARTARGMAX/TPARTARGMIN

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for MSCATTER and MGATHER operations in the NPU validation framework, including logic for index generation and result comparison. It also updates quantization samples to utilize per-row scaling and offsets, simplifying the golden reference generation. Furthermore, the remote validation script now includes a check for required ISA symbols to skip unsupported test cases. Feedback was provided regarding the heuristic used to identify scatter indices, noting an inconsistency with existing logic that could lead to incorrect operand identification.

Comment on lines +1886 to +1892
for p in reversed(init_ptrs):
p_dtype = _np_dtype_for_cpp(p["cpp_type"])
if p.get("role") == "input" and (
p_dtype.startswith("np.int") or p_dtype.startswith("np.uint")
):
mscatter_indices_input = p
break
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The heuristic for identifying mscatter_indices_input uses reversed(init_ptrs), which selects the last integer input as the indices. This is inconsistent with the tscatter logic (which selects the first integer input) and may be incorrect depending on the operand order of the MSCATTER operation. In mscatter.py, arg0 appears to be the indices and arg1 the data; if so, this heuristic will misidentify arg1 as the indices, potentially breaking the compare_bin_at_indices logic later. Consider if init_ptrs (picking the first) would be more appropriate or if a more robust identification method is needed.

@HecreReed
Copy link
Copy Markdown
Collaborator Author

/run a5 abs quant quant_asym mgather mscatter partarg

@reedhecre
Copy link
Copy Markdown

已接收 /run a5 abs quant quant_asym mgather mscatter partarg,A5 板测器会处理这条请求。

页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。

@reedhecre
Copy link
Copy Markdown

Codex Review

该评论由 review 机器人自动更新。

  • PR: fix: stabilize A5 validation samples #642 fix: stabilize A5 validation samples
  • Author: HecreReed
  • Base/Head: main / a5-six-fixes
  • Head SHA: d4494c8a22eb
  • Trigger: 检测到新的 open PR
  • Generated At: 2026-05-08T02:30:15Z
  • Status: failed at codex-review (exit=1)

Summary

Review failed at stage codex-review: exit=1

Findings

未生成结构化 findings,因为 review 过程提前失败。

Log Tail

git fetch origin 'refs/pull/642/head:pr-642' --depth 50
git fetch origin 'main' --depth 50 || true
git checkout -f 'pr-642'
git rev-parse HEAD
git diff --stat 'origin/main...HEAD' || true
Cloning into '/tmp/ptoas-pr-review-monitor/runs/20260508_103002_pr642/repo'...
From https://github.com/hw-native-sys/PTOAS
 * [new ref]         refs/pull/642/head -> pr-642
From https://github.com/hw-native-sys/PTOAS
 * branch            main       -> FETCH_HEAD
Switched to branch 'pr-642'
d4494c8a22eb4e4471989391710856bafdccc52f
 test/npu_validation/scripts/generate_testcase.py   | 67 ++++++++++++++++++++--
 .../scripts/run_remote_npu_validation.sh           | 15 +++++
 test/samples/Abs/abs.py                            |  5 +-
 test/samples/Mgather/mgather.py                    |  9 +--
 test/samples/Mscatter/mscatter.py                  |  5 +-
 test/samples/Quant/quant.py                        | 40 +++++++++----
 test/samples/Quant/quant_asym.py                   | 50 +++++++++++-----
 test/samples/Quant/quant_asym_golden.py            | 27 +++------
 test/samples/Quant/quant_golden.py                 | 24 ++------
 9 files changed, 157 insertions(+), 85 deletions(-)
===== END STAGE clone rc=0 @ 2026-05-08 10:30:06 =====

===== STAGE codex-review @ 2026-05-08 10:30:06 =====
set -euo pipefail
cd '/tmp/ptoas-pr-review-monitor/runs/20260508_103002_pr642/repo'
'codex' exec -C '/tmp/ptoas-pr-review-monitor/runs/20260508_103002_pr642/repo' -s read-only -c 'model_provider="codereview"' -c 'model="gpt-5.4"' -c 'model_reasoning_effort="xhigh"' --output-schema '/tmp/ptoas-pr-review-monitor/runs/20260508_103002_pr642/review_schema.json' -o '/tmp/ptoas-pr-review-monitor/runs/20260508_103002_pr642/codex_last_message.json' --color never - < '/tmp/ptoas-pr-review-monitor/runs/20260508_103002_pr642/review_prompt.txt'
OpenAI Codex v0.115.0 (research preview)
--------
workdir: /tmp/ptoas-pr-review-monitor/runs/20260508_103002_pr642/repo
model: gpt-5.4
provider: codereview
approval: never
sandbox: read-only
reasoning effort: xhigh
reasoning summaries: none
session id: 019e056b-be5d-78e0-a4e2-8edae96c8c2a
--------
user
你现在在审查 GitHub PR。

仓库:hw-native-sys/PTOAS
PR:#642 fix: stabilize A5 validation samples
作者:HecreReed
base branch:origin/main
head branch:HEAD(当前已 checkout 到 PR head)

要求:
1. 只审查这个 PR 相对 origin/main 的改动,必要时可以看上下文文件。
2. 重点找真实的 correctness / regression / contract mismatch / CI / runtime / compatibility 问题。
3. 不要提纯风格建议,不要提低价值猜测。
4. 严格按优先级输出:
   - P1:高概率会导致错误结果、编译/运行失败、严重回归、发布阻断
   - P2:重要缺陷、行为回归、遗漏校验/测试、较大兼容性问题
   - P3:次要但明确可改的问题
5. 如果没有问题,summary 直接写:未检查到 PR #642 存在问题,并返回 findings=[]。
6. 如果有问题,summary 简洁概括,findings 里每条都要给出:
   - severity
   - title
   - body(说明为什么是问题,尽量具体)
   - file(尽量给相对路径)
   - line(能确定就填整数,否则 null)

建议先查看:
- git status --short
- git diff --stat origin/main...HEAD
- git diff --unified=80 origin/main...HEAD

最终输出必须严格匹配 JSON schema。

mcp startup: no servers
Reconnecting... 1/5 (unexpected status 403 Forbidden: {"code":"INSUFFICIENT_BALANCE","message":"Insufficient account balance"}, url: https://codex.0u0o.com/responses, request id: e20bf131-961b-453b-8318-cca56e2dedac)
Reconnecting... 2/5 (unexpected status 403 Forbidden: {"code":"INSUFFICIENT_BALANCE","message":"Insufficient account balance"}, url: https://codex.0u0o.com/responses, request id: 07612248-2784-4e1c-8514-15f293c68233)
Reconnecting... 3/5 (unexpected status 403 Forbidden: {"code":"INSUFFICIENT_BALANCE","message":"Insufficient account balance"}, url: https://codex.0u0o.com/responses, request id: fe6addee-bd56-4c67-a709-085787a8ea9a)
Reconnecting... 4/5 (unexpected status 403 Forbidden: {"code":"INSUFFICIENT_BALANCE","message":"Insufficient account balance"}, url: https://codex.0u0o.com/responses, request id: 260ba66d-f088-4f0f-b363-5c6a45388a75)
Reconnecting... 5/5 (unexpected status 403 Forbidden: {"code":"INSUFFICIENT_BALANCE","message":"Insufficient account balance"}, url: https://codex.0u0o.com/responses, request id: 9a5e8b20-c9f4-4a02-9974-a4a1a4d4b10d)
ERROR: unexpected status 403 Forbidden: {"code":"INSUFFICIENT_BALANCE","message":"Insufficient account balance"}, url: https://codex.0u0o.com/responses, request id: 82553bf1-2195-4790-aa45-072ca6c9d662
Warning: no last agent message; wrote empty content to /tmp/ptoas-pr-review-monitor/runs/20260508_103002_pr642/codex_last_message.json
===== END STAGE codex-review rc=1 @ 2026-05-08 10:30:15 =====

@reedhecre
Copy link
Copy Markdown

A5 板测失败

  • 触发方式:manual
  • 源码提交:9fab97bfdd30
  • 结果汇总:OK 4 / FAIL 1 / SKIP 1
  • 日志:/root/ptoas-board-monitor-a5/logs/20260508_102905_manual_pr642.log
  • 手动指令:/run a5 abs quant quant_asym mgather mscatter partarg
  • 触发人:HecreReed
  • 指定用例:abs,quant,quant_asym,mgather,mscatter,partarg
  • 触发评论:fix: stabilize A5 validation samples #642 (comment)
  • 失败阶段:board-validation / exit=1

失败用例

  • mscatter (run, exit=1)

@reedhecre
Copy link
Copy Markdown

A5 板测失败详情:PR #642

mscatter

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/tmp/ptoas-board-monitor-a5/runs/20260508_102905_manual_pr642/npu_validation/Mscatter/mscatter/main.cpp:99)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 1448460] 2026-05-08-10:34:09.272.332 (EZ9999):  The error from device(chipId:0, dieId:0), serial number is 30, there is an aivec error exception, core id is 0, error code = 334, dump info: pc start: 0x100040800000, current: 0x1000408000f0, sc error info: 0xffffffffffff, su error info: 0xe6f7d23d139c7bd7,0xcc3fd0e410009bfd, mte error info: 0x1fd3f5c60007eff1, vec error info: 0x408001e000390037, cube error info: 0, l1 error info: 0, aic error mask: 0x395856, para base: 0x100040200000, mte error: 0.[FUNC:ProcessDavidStarsCoreErrorInfo][FILE:device_error_proc_c.cc][LINE:580]
        TraceBack (most recent call last):
       The extend info: errcode:(334) errorStr: The data returned by the BIU to the VEC is incorrect. subErrType: 0x4.[FUNC:ProcessDavidStarsCoreErrorInfo][FILE:device_error_proc_c.cc][LINE:583]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1728]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1478]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1478]
       [DFX_INFO]Aicore kernel execute failed, device_id=1, stream_id=61, report_stream_id=61, task_id=0, flip_num=0, fault kernel_name=_Z18mscatter_kernel_2dPiS_S_, fault kernel info ext=_Z18mscatter_kernel_2dPiS_S_, program id=0, hash=279618682955286547.[FUNC:GetError][FILE:stream.cc][LINE:1478]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-05-08 10:34:44] ERROR: testcase failed (exit 1): mscatter

@HecreReed
Copy link
Copy Markdown
Collaborator Author

/run a5 abs quant quant_asym mgather mscatter partarg

@reedhecre
Copy link
Copy Markdown

已接收 /run a5 abs quant quant_asym mgather mscatter partarg,A5 板测器会处理这条请求。

页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。

@reedhecre
Copy link
Copy Markdown

A5 板测失败

  • 触发方式:manual
  • 源码提交:9fab97bfdd30
  • 结果汇总:OK 4 / FAIL 1 / SKIP 1
  • 日志:/root/ptoas-board-monitor-a5/logs/20260508_143304_manual_pr642.log
  • 手动指令:/run a5 abs quant quant_asym mgather mscatter partarg
  • 触发人:HecreReed
  • 指定用例:abs,quant,quant_asym,mgather,mscatter,partarg
  • 触发评论:fix: stabilize A5 validation samples #642 (comment)
  • 失败阶段:board-validation / exit=1

失败用例

  • mscatter (run, exit=1)

@reedhecre
Copy link
Copy Markdown

A5 板测失败详情:PR #642

mscatter

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/tmp/ptoas-board-monitor-a5/runs/20260508_143304_manual_pr642/npu_validation/Mscatter/mscatter/main.cpp:99)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 177569] 2026-05-08-14:36:02.662.566 (EZ9999):  The error from device(chipId:0, dieId:0), serial number is 1, there is an aivec error exception, core id is 0, error code = 334, dump info: pc start: 0x100040800000, current: 0x1000408000f0, sc error info: 0xffffffffffff, su error info: 0xe6f7d23d139c7b97,0xcc3fd0e410009bf5, mte error info: 0x1fd3f5c600076ff1, vec error info: 0x408001e000390037, cube error info: 0, l1 error info: 0, aic error mask: 0x395856, para base: 0x100040200000, mte error: 0.[FUNC:ProcessDavidStarsCoreErrorInfo][FILE:device_error_proc_c.cc][LINE:580]
        TraceBack (most recent call last):
       The extend info: errcode:(334) errorStr: The data returned by the BIU to the VEC is incorrect. subErrType: 0x4.[FUNC:ProcessDavidStarsCoreErrorInfo][FILE:device_error_proc_c.cc][LINE:583]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1728]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1478]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1478]
       [DFX_INFO]Aicore kernel execute failed, device_id=1, stream_id=62, report_stream_id=62, task_id=0, flip_num=0, fault kernel_name=_Z18mscatter_kernel_2dPiS_S_, fault kernel info ext=_Z18mscatter_kernel_2dPiS_S_, program id=0, hash=279618682955286547.[FUNC:GetError][FILE:stream.cc][LINE:1478]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-05-08 14:36:08] ERROR: testcase failed (exit 1): mscatter

@HecreReed
Copy link
Copy Markdown
Collaborator Author

/run a5 abs quant quant_asym mgather mscatter partarg

@reedhecre
Copy link
Copy Markdown

已接收 /run a5 abs quant quant_asym mgather mscatter partarg,A5 板测器会处理这条请求。

页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。

@reedhecre
Copy link
Copy Markdown

A5 板测失败

  • 触发方式:manual
  • 源码提交:d8e11d0e28f1
  • 结果汇总:OK 0 / FAIL 5 / SKIP 1
  • 日志:/root/ptoas-board-monitor-a5/logs/20260509_101705_manual_pr642.log
  • 手动指令:/run a5 abs quant quant_asym mgather mscatter partarg
  • 触发人:HecreReed
  • 指定用例:abs,quant,quant_asym,mgather,mscatter,partarg
  • 触发评论:fix: stabilize A5 validation samples #642 (comment)
  • 失败阶段:board-validation / exit=1

失败用例

  • quant (run, exit=1)
  • quant_asym (run, exit=1)
  • mscatter (run, exit=1)
  • mgather (run, exit=1)
  • abs (run, exit=1)

@reedhecre
Copy link
Copy Markdown

A5 板测失败详情:PR #642

quant

stage=run info=exit=1

[ERROR] aclrtSetDevice(deviceId) failed: 507033 (/tmp/ptoas-board-monitor-a5/runs/20260509_101705_manual_pr642/npu_validation/Quant/quant/main.cpp:79)
[ERROR] RecentErrMsg: [PID: 220619] 2026-05-09-10:19:24.696.599 Invalid_Argument(EE1001): The argument is invalid.Reason: rtGetDevMsg execution failed, the context is a null pointer.
        Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
        TraceBack (most recent call last):
        TsdOpen failed. devId=1, tdt error=1[FUNC:PrintfTsdError][FILE:runtime.cc][LINE:2618]
        Check param failed, dev can not be NULL![FUNC:DeviceRetain][FILE:runtime.cc][LINE:3536]
        Check param failed, dev can not be NULL![FUNC:PrimaryContextRetain][FILE:runtime.cc][LINE:3153]
        Check param failed, ctx can not be NULL![FUNC:PrimaryContextRetain][FILE:runtime.cc][LINE:3184]
        Check param failed, context can not be null.[FUNC:SetDevice][FILE:api_impl.cc][LINE:3321]
        rtSetDevice execution failed, reason=device retain error[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
        open device 1 failed, runtime result = 507033.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
        ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:6120]
        The argument is invalid.Reason: rtGetDevMsg execution failed, the context is a null pointer.
[2026-05-09 10:19:25] ERROR: testcase failed (exit 1): quant
quant_asym

stage=run info=exit=1

[ERROR] aclrtSetDevice(deviceId) failed: 507033 (/tmp/ptoas-board-monitor-a5/runs/20260509_101705_manual_pr642/npu_validation/Quant/quant_asym/main.cpp:83)
[ERROR] RecentErrMsg: [PID: 221534] 2026-05-09-10:19:27.933.285 Invalid_Argument(EE1001): The argument is invalid.Reason: rtGetDevMsg execution failed, the context is a null pointer.
        Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
        TraceBack (most recent call last):
        TsdOpen failed. devId=1, tdt error=1[FUNC:PrintfTsdError][FILE:runtime.cc][LINE:2618]
        Check param failed, dev can not be NULL![FUNC:DeviceRetain][FILE:runtime.cc][LINE:3536]
        Check param failed, dev can not be NULL![FUNC:PrimaryContextRetain][FILE:runtime.cc][LINE:3153]
        Check param failed, ctx can not be NULL![FUNC:PrimaryContextRetain][FILE:runtime.cc][LINE:3184]
        Check param failed, context can not be null.[FUNC:SetDevice][FILE:api_impl.cc][LINE:3321]
        rtSetDevice execution failed, reason=device retain error[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
        open device 1 failed, runtime result = 507033.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
        ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:6120]
        The argument is invalid.Reason: rtGetDevMsg execution failed, the context is a null pointer.
[2026-05-09 10:19:28] ERROR: testcase failed (exit 1): quant_asym
[2026-05-09 10:19:28] SKIP: partarg (pto-isa missing TPARTARG intrinsics)
mscatter

stage=run info=exit=1

[ERROR] aclrtSetDevice(deviceId) failed: 507033 (/tmp/ptoas-board-monitor-a5/runs/20260509_101705_manual_pr642/npu_validation/Mscatter/mscatter/main.cpp:79)
[ERROR] RecentErrMsg: [PID: 222041] 2026-05-09-10:19:31.142.318 Invalid_Argument(EE1001): The argument is invalid.Reason: rtGetDevMsg execution failed, the context is a null pointer.
        Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
        TraceBack (most recent call last):
        TsdOpen failed. devId=1, tdt error=1[FUNC:PrintfTsdError][FILE:runtime.cc][LINE:2618]
        Check param failed, dev can not be NULL![FUNC:DeviceRetain][FILE:runtime.cc][LINE:3536]
        Check param failed, dev can not be NULL![FUNC:PrimaryContextRetain][FILE:runtime.cc][LINE:3153]
        Check param failed, ctx can not be NULL![FUNC:PrimaryContextRetain][FILE:runtime.cc][LINE:3184]
        Check param failed, context can not be null.[FUNC:SetDevice][FILE:api_impl.cc][LINE:3321]
        rtSetDevice execution failed, reason=device retain error[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
        open device 1 failed, runtime result = 507033.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
        ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:6120]
        The argument is invalid.Reason: rtGetDevMsg execution failed, the context is a null pointer.
[2026-05-09 10:19:31] ERROR: testcase failed (exit 1): mscatter
mgather

stage=run info=exit=1

[ERROR] aclrtSetDevice(deviceId) failed: 507033 (/tmp/ptoas-board-monitor-a5/runs/20260509_101705_manual_pr642/npu_validation/Mgather/mgather/main.cpp:79)
[ERROR] RecentErrMsg: [PID: 222537] 2026-05-09-10:19:34.322.085 Invalid_Argument(EE1001): The argument is invalid.Reason: rtGetDevMsg execution failed, the context is a null pointer.
        Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
        TraceBack (most recent call last):
        TsdOpen failed. devId=1, tdt error=1[FUNC:PrintfTsdError][FILE:runtime.cc][LINE:2618]
        Check param failed, dev can not be NULL![FUNC:DeviceRetain][FILE:runtime.cc][LINE:3536]
        Check param failed, dev can not be NULL![FUNC:PrimaryContextRetain][FILE:runtime.cc][LINE:3153]
        Check param failed, ctx can not be NULL![FUNC:PrimaryContextRetain][FILE:runtime.cc][LINE:3184]
        Check param failed, context can not be null.[FUNC:SetDevice][FILE:api_impl.cc][LINE:3321]
        rtSetDevice execution failed, reason=device retain error[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
        open device 1 failed, runtime result = 507033.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
        ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:6120]
        The argument is invalid.Reason: rtGetDevMsg execution failed, the context is a null pointer.
[2026-05-09 10:19:34] ERROR: testcase failed (exit 1): mgather
abs

stage=run info=exit=1

[ERROR] aclrtSetDevice(deviceId) failed: 507033 (/tmp/ptoas-board-monitor-a5/runs/20260509_101705_manual_pr642/npu_validation/Abs/abs/main.cpp:75)
[ERROR] RecentErrMsg: [PID: 223031] 2026-05-09-10:19:37.513.638 Invalid_Argument(EE1001): The argument is invalid.Reason: rtGetDevMsg execution failed, the context is a null pointer.
        Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
        TraceBack (most recent call last):
        TsdOpen failed. devId=1, tdt error=1[FUNC:PrintfTsdError][FILE:runtime.cc][LINE:2618]
        Check param failed, dev can not be NULL![FUNC:DeviceRetain][FILE:runtime.cc][LINE:3536]
        Check param failed, dev can not be NULL![FUNC:PrimaryContextRetain][FILE:runtime.cc][LINE:3153]
        Check param failed, ctx can not be NULL![FUNC:PrimaryContextRetain][FILE:runtime.cc][LINE:3184]
        Check param failed, context can not be null.[FUNC:SetDevice][FILE:api_impl.cc][LINE:3321]
        rtSetDevice execution failed, reason=device retain error[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
        open device 1 failed, runtime result = 507033.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
        ctx is NULL![FUNC:GetDevErrMsg][FILE:api_impl.cc][LINE:6120]
        The argument is invalid.Reason: rtGetDevMsg execution failed, the context is a null pointer.
[2026-05-09 10:19:37] ERROR: testcase failed (exit 1): abs
[2026-05-09 10:19:38] === SUMMARY ===
[2026-05-09 10:19:38] OK=0 FAIL=5 SKIP=1
[2026-05-09 10:19:38] RESULTS_TSV=/tmp/ptoas-board-monitor-a5/runs/20260509_101705_manual_pr642/remote_npu_validation_results.tsv

@HecreReed
Copy link
Copy Markdown
Collaborator Author

/run a5 abs quant quant_asym mgather mscatter partarg

@reedhecre
Copy link
Copy Markdown

已接收 /run a5 abs quant quant_asym mgather mscatter partarg,A5 板测器会处理这条请求。

页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。

@reedhecre
Copy link
Copy Markdown

A5 板测失败

  • 触发方式:manual
  • 源码提交:d8e11d0e28f1
  • 结果汇总:OK 4 / FAIL 1 / SKIP 1
  • 日志:/root/ptoas-board-monitor-a5/logs/20260509_141705_manual_pr642.log
  • 手动指令:/run a5 abs quant quant_asym mgather mscatter partarg
  • 触发人:HecreReed
  • 指定用例:abs,quant,quant_asym,mgather,mscatter,partarg
  • 触发评论:fix: stabilize A5 validation samples #642 (comment)
  • 失败阶段:board-validation / exit=1

失败用例

  • mscatter (run, exit=1)

@reedhecre
Copy link
Copy Markdown

A5 板测失败详情:PR #642

mscatter

stage=run info=exit=1

[ERROR] aclrtSynchronizeStream(stream) failed: 507035 (/tmp/ptoas-board-monitor-a5/runs/20260509_141705_manual_pr642/npu_validation/Mscatter/mscatter/main.cpp:99)
[ERROR] RecentErrMsg: EZ9999: Inner Error!
EZ9999[PID: 79475] 2026-05-09-14:20:03.840.387 (EZ9999):  The error from device(chipId:0, dieId:0), serial number is 4, there is an aivec error exception, core id is 0, error code = 334, dump info: pc start: 0x100040800000, current: 0x1000408000f0, sc error info: 0xffffffffffff, su error info: 0xe6f7d23d139c5bb7,0xcc3fd0e410009bfd, mte error info: 0x2051a, vec error info: 0x408001e000390037, cube error info: 0, l1 error info: 0, aic error mask: 0x395856, para base: 0x100040200000, mte error: 0.[FUNC:ProcessDavidStarsCoreErrorInfo][FILE:device_error_proc_c.cc][LINE:580]
        TraceBack (most recent call last):
       The extend info: errcode:(334) errorStr: The data returned by the BIU to the VEC is incorrect. subErrType: 0x4.[FUNC:ProcessDavidStarsCoreErrorInfo][FILE:device_error_proc_c.cc][LINE:583]
       Kernel task happen error, retCode=0x31, [vector core exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1728]
       AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1478]
       [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1478]
       [DFX_INFO]Aicore kernel execute failed, device_id=1, stream_id=62, report_stream_id=62, task_id=0, flip_num=0, fault kernel_name=_Z18mscatter_kernel_2dPiS_S_, fault kernel info ext=_Z18mscatter_kernel_2dPiS_S_, program id=0, hash=279618682955286547.[FUNC:GetError][FILE:stream.cc][LINE:1478]
       rtStreamSynchronize execution failed, reason=vector core exception[FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:65]
       synchronize stream failed, runtime result = 507035[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:148]
[2026-05-09 14:20:09] ERROR: testcase failed (exit 1): mscatter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants