Add FA PTO lit regression cases#609
Conversation
|
/run a3 test/lit/pto/fa_perf.pto |
|
已接收
页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。 |
A3 板测成功
|
|
/run a3 test/lit/pto/fa.pto |
|
已接收
页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。 |
There was a problem hiding this comment.
Code Review
This pull request introduces new test files for the pto service. The review identified critical issues regarding incorrect pipe initialization parameters, specifically the dir_mask and tensor view shapes. Additionally, several opportunities for code cleanup were noted, such as removing redundant constant definitions and moving loop-invariant constants outside of loops.
| %qk_slot_desc = pto.make_tensor_view %21, shape = [%c128, %c256], strides = [%c256, %c1] : !pto.tensor_view<128x256xf32> | ||
| pto.aiv_initialize_pipe{id = 25, dir_mask = 1, slot_size = 131072} (gm_slot_tensor = %qk_slot_desc : !pto.tensor_view<128x256xf32>) | ||
| %pv_slot_desc = pto.make_tensor_view %22, shape = [%c128, %c128_0], strides = [%c128_0, %c1] : !pto.tensor_view<128x128xf32> | ||
| pto.aiv_initialize_pipe{id = 27, dir_mask = 1, slot_size = 65536} (gm_slot_tensor = %pv_slot_desc : !pto.tensor_view<128x128xf32>) |
There was a problem hiding this comment.
The dir_mask for pipe 27 in vector_kernel appears to be incorrect. This kernel acts as a consumer for pipe 27 (as shown by tpop_from_aic), so the dir_mask should be 2 (consumer), not 1 (producer).
pto.aiv_initialize_pipe{id = 27, dir_mask = 2, slot_size = 65536} (gm_slot_tensor = %pv_slot_desc : !pto.tensor_view<128x128xf32>)
| %pv_slot_desc = pto.make_tensor_view %22, shape = [%c64, %c128_0], strides = [%c128_0, %c1] : !pto.tensor_view<64x128xf32> | ||
| pto.aiv_initialize_pipe{id = 27, dir_mask = 1, slot_size = 65536} (gm_slot_tensor = %pv_slot_desc : !pto.tensor_view<64x128xf32>) |
There was a problem hiding this comment.
There appear to be two inconsistencies in the initialization of pipe 27:
- The
dir_maskshould be2(consumer), not1, because this kernel consumes from pipe 27 (seetpop_from_aiccalls). - The
gm_slot_tensorshape is64x128xf32, which mismatches the producer's (cube_kernel) shape of128x128xf32for the same pipe. The global memory layout of a pipe slot should be consistent.
%pv_slot_desc = pto.make_tensor_view %22, shape = [%c128, %c128_0], strides = [%c128_0, %c1] : !pto.tensor_view<128x128xf32>
pto.aiv_initialize_pipe{id = 27, dir_mask = 2, slot_size = 65536} (gm_slot_tensor = %pv_slot_desc : !pto.tensor_view<128x128xf32>)
| %c256_13 = arith.constant 256 : index | ||
| %c0_14 = arith.constant 0 : index | ||
| %51 = arith.addi %c256_13, %c0_14 : index | ||
| %52 = pto.partition_view %41, offsets = [%c0, %51], sizes = [%c128_0, %c128_1] : !pto.tensor_view<?x?xf16> -> !pto.partition_tensor_view<128x128xf16> |
There was a problem hiding this comment.
The constant %c256_13 is redefined inside the loop, and then used in a redundant addition with zero. You can simplify this by using the existing %c256 constant (defined at line 9) directly in the pto.partition_view operation.
%52 = pto.partition_view %41, offsets = [%c0, %c256], sizes = [%c128_0, %c128_1] : !pto.tensor_view<?x?xf16> -> !pto.partition_tensor_view<128x128xf16>
| %42 = pto.make_tensor_view %arg1, shape = [%c2048, %c128_0], strides = [%c128_0, %c1] : !pto.tensor_view<?x?xf32> | ||
| scf.for %arg2 = %14 to %18 step %c1 { | ||
| %43 = arith.muli %arg2, %c128 : index | ||
| %c394752_i64 = arith.constant 394752 : i64 |
| %c256_13 = arith.constant 256 : index | ||
| %c0_14 = arith.constant 0 : index | ||
| %51 = arith.addi %c256_13, %c0_14 : index | ||
| %52 = pto.partition_view %41, offsets = [%c0, %51], sizes = [%c128_0, %c128_1] : !pto.tensor_view<?x?xf16> -> !pto.partition_tensor_view<128x128xf16> |
There was a problem hiding this comment.
The constant %c256_13 is redefined inside the loop, and then used in a redundant addition with zero. You can simplify this by using the existing %c256 constant (defined at line 9) directly in the pto.partition_view operation.
%52 = pto.partition_view %41, offsets = [%c0, %c256], sizes = [%c128_0, %c128_1] : !pto.tensor_view<?x?xf16> -> !pto.partition_tensor_view<128x128xf16>
| %42 = pto.make_tensor_view %arg1, shape = [%c2048, %c128_0], strides = [%c128_0, %c1] : !pto.tensor_view<?x?xf32> | ||
| scf.for %arg2 = %14 to %18 step %c1 { | ||
| %43 = arith.muli %arg2, %c128 : index | ||
| %c394752_i64 = arith.constant 394752 : i64 |
A3 板测成功
|
Codex Review该评论由 review 机器人自动更新。
SummaryPR #609 adds FA lit cases, but they only assert successful compilation and do not actually guard the FA preload/split lowering behavior the PR is trying to regress. Findings
Both |
|
/run a3 ../lit/pto/fa.pto |
|
已接收
页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。 |
A3 板测成功
|
|
/run a3 test/lit/pto/fa_perf.pto |
|
已接收
页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。 |
A3 板测失败
日志尾部 |
|
/run a3 test/lit/pto/fa_perf.pto --pto-level=level3 |
|
已接收
页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。 |
A3 板测失败
日志尾部 |
What is the driver program (C++ main entry) to run this on-board test? I tried launching with torch-npu here in ir_ref/launch_kernel But got run-time error: (I am using ptoas 0.36 release to generate the cpp) In comparison the manual C++ runs fine cpp_ref/split_pipe My test environment is this Dockerfile as used by huawei-csl/pto-dsl#130 |
|
/run a3 test/lit/pto/fa_perf.pto --pto-level=level3 |
|
已接收
页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。 |
not successful just now,bug because github robot |
A3 板测失败
日志尾部 |
|
/run a3 test/lit/pto/fa.pto --pto-level=level3 |
|
已接收
页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。 |
A3 板测失败
日志尾部 |
|
/run a3 test/lit/pto/fa.pto --pto-level=level3 |
|
已接收
页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。 |
A3 板测失败
失败用例
|
A3 板测失败详情:PR #609fa
|
|
pto-isa-feature-subtile/tests/npu/a2a3/src/st/testcase/fa_ptoas_gm_pipe_smoke/main.cpp #include "acl/acl.h" using namespace std; template <int32_t tilingKey> class FaPtoasGmPipeTest : public testing::Test { static std::string GetGoldenDir() TEST_F(FaPtoasGmPipeTest, case_half_128x4096) } |
|
pto-isa-feature-subtile/tests/npu/a2a3/src/st/testcase/fa_ptoas_gm_pipe_smoke/fa_ptoas_gm_pipe_smoke_kernel.cpp #include "fa_perf_smoke_c220.inc" template <int32_t tilingKey> template void LaunchFaPtoasGmPipe<1>(uint8_t *ffts, uint8_t *q, uint8_t *kt, uint8_t *v, uint8_t *pFifo, |
|
/run a5 |
|
已接收
页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。 |
A5 板测失败
日志尾部 |
|
/run a5 |
|
已接收
页面会自动刷新,可以直接看当前阶段、排队情况和最近结果。 |
A5 板测失败
日志尾部 |
No description provided.