feat(Davinci-v2): Block-ROB for block-granularity precise exception support (DSP-003)#53
feat(Davinci-v2): Block-ROB for block-granularity precise exception support (DSP-003)#53zhoubot wants to merge 12 commits intoLinxISA:mainfrom
Conversation
- vector4k.md: dataflow (TRegFile, control, crossbar, 128 groups, ping-pong Acc), N_run/N_tree/#W scheduling, simplified §9.7 shape table, Acc RMW vs bypass-to-DFF - PTOISA/: tile ISA reference (linked from vector4k.md) Made-with: Cursor
Made-with: Cursor
Davinci-v2 superscalar core spec extending v1 with three architectural upgrades: 1. TRegFile-4K with per-port `is_transpose` flag (per tregfile4k.md §7) — row-mode or col-mode chunk-grid delivery at full 512 B/cy, no SRAM duplication, no extra latency. Eliminates most TILE.TRANSPOSE predecessors. 2. Vector unit re-architected to VEC-4K-v2 (per vector4k_v2.md): 3 source tile operands (A/B value + C bitmask), 2 dest (D0/D1), per-element predication with zero fetch-phase cost, tile metadata (shape.x/shape.y/format), SRAM-based staging, and three new ops — TINV (matrix inverse up to 128x128 FP32), TROWRANGE_MUL, TMRGSORT (reconfigurable bitonic sort up to N=8192). 3. Branch-prediction-driven speculative execution with ROB-less recovery (§11) — proves that ROB's three bundled services unbundle in the AI- kernel envelope: precise exceptions are out of envelope, in-order resource freeing uses refcount, and speculative-memory recovery is handled by a 24-entry Speculative Store Buffer + 8-entry Speculative Tile-Store Queue + 5K-gate Branch-Tag Speculation Tracker (8 tags + 8x8 ancestry bitmap). Total ~110K gate (~3.5% of v1 core area). Mispredict penalty stays at 6-7 cycles. Net core area: ~3.41 mm^2 (+5% vs v1's ~3.26 mm^2). Made-with: Cursor
tregfile4k.md (~+450 lines): - Diagonal/skewed bank map: bank_id = 8*g + ((l + g) mod 8) - New per-read-port `is_transpose` bit (row-mode vs col-mode delivery) - §7 transposed-read enhancement: full 512 B/cy in either order, bank-conflict-free, no SRAM duplication - §6 rule R2: all 8 active reads of one epoch share is_transpose vector4k.md (~-200 lines): trimmed to reflect VEC-4K v1 baseline that defers transpose / 3-operand / mask features to v2. vector4k_v2.md (new, 1805 lines): full VEC-4K-v2 spec - 3 source tile operands (A, B value + C 1-bit-per-element bitmask) - 2 dest tiles (D0, D1) with dual retire - Tile-register metadata (32 b: shape.x, shape.y, format) - Explicit staging registers (SA, SB, SC, SX, SY, SOP) — 1R1W SRAM baseline (24 macros), FF alternative for FPGA prototypes - Variable-length operand-fetch prologue (8 / 16 cy) - Microcode-driven beat machine; per-beat tilelet_xpose - Unified ALU + Acc feedback (no dedicated Acc RMW adder) - §7.5 novel ops: TINV (matrix inverse up to 128x128 FP32 / 16 tiles), TROWRANGE_MUL (column-wise product over dynamic row sub-range), TMRGSORT (reconfigurable 256-lane shuffle+CAS, any N=2^p up to 8192) - §10 area & routing comparison vs v1: ~27% smaller (SRAM staging) vector512.md (new): VEC-512 sibling spec for 512 B tiles (S=1). Made-with: Cursor
Minor updates to nv_shuffle.py and plot_tile16_vector_datapath.py; regenerate elementwise/expand/mergesort/nv_shuffle/reduce figures under tile16_figures/. Made-with: Cursor
Adds a real 3-operand TFMA/VFMA/VFNMA/VLERP instruction family to the Davinci-v2.1 vector ISA, motivated by FMA指令场景说明.md (LayerNorm γ·x̂+β, Welford updates, gelu/swiglu/sin/cos polynomials). Changes: - Promote operand C to a dual role (mask | value) via a new c_role bit. - Bind a 3rd VEC-side TRegFile read port (R1=Port C); TRegFile-4K already has 8R, so this is a binding allocation only (~0 SRAM). - 3-port parallel fetch keeps native TFMA at 8 cy fetch / ~10-12 cy end-to-end, same throughput as a binary VADD — ~2× speedup over emulated VMUL+VADD, single-rounding FMA precision. - Hardware delta vs. v2.0: ~6 K gate (~0.2 % of VEC-4K-v2 area). vector4k_v2.md (v0.18): §1 features, §3.1 ports (3R), §3.3a/c rewritten for dual role, §6.2/6.3 fetch cycle table extended for N_val=3, new §7.6 with full ISA semantics + LayerNorm worked example + hardware cost breakdown, §8/§10 updated, Document History 0.18 entry. Davinci_superscalar_v2.md (v2.1): §2.2.2 operand model, §2.2.3 encoding, new §2.2.6a (VFMA/VFNMA/VLERP), §2.2.8 Category O, §8.3.7 latency table updated, Document History v2.1 entry. Also adds: - tregfile4k_v2.md — self-contained v2 spec with explicit v1/v2 versioning markers (companion to vector4k_v2.md). - FMA指令场景说明.md — source motivation document. Backward compatibility preserved: c_role defaults to MASK in v1/v2.0 binaries; R1 stays idle and clock-gated when no 3-source op is in flight. Made-with: Cursor
Introduce LinxCore BCC-style scalar frontend/rename/issue pipeline: - Split rename into D1/D2/D3 three-stage pipeline (decode → rename request → rename complete) - Replace centralized Scalar RS with three physical IQs: alu_iq (48, 4-wide), bru_iq (16, 1-wide), lsu_iq (32, 2-wide) - Replace CDB comparators (384) with Ready Table (128-bit bitmap): O(1) ptag lookup - Age-matrix issue picker using RID-based sub-head age (mod 64, wrap-friendly) - Replace RAT checkpoints with MapQ (12-entry speculative rename increment log) - Adopt atag/ptag/MapQ naming throughout - Preserve: 128 physical GPRs, multi-latency FUs, no-ROB recovery, SSB/STQ, AI kernel envelope Change point LinxISA#1: adopt LinxCore BCC scalar pipeline into Davinci-v2. Co-authored-by: Cursor <cursoragent@cursor.com>
…ame, Ready Table Major architectural updates drawn from the LinxCore BCC design and the feature/davinci-bcc-scalar-pipeline PR, merged into the primary spec: - Pipeline: extend from 12 → 17+ stages (F0→F1→F2→F3→IB→F4→D1→D2→D3→S1→S2→P1→I1→I2→E1→EX_n→W1) - Rename: RAT checkpoints → CMAP/SMAP/MapQ three-table model; architectural tag (atag) / physical tag (ptag) terminology - Issue queues: unified RS → 3 physical IQs (alu_iq 48, bru_iq 16, lsu_iq 32) - Wakeup: CDB comparator arrays → Ready Table (128-bit bitmap, O(1) lookup) - Issue picker: wide CAM → age-matrix cascaded pick (RID-based sub-head age) - §1 params, §3 block diagram, §4 pipeline, §6 decode/rename, §7 dispatch/issue, §10 OoO model, §11 recovery all updated Co-authored-by: Cursor <cursoragent@cursor.com>
…xecution (DSP-002) Change Point LinxISA#2 — adds warp-grouped VTG (Vector Thread Group) execution model on top of VEC-4K-v2, with a pre-allocated micro-instruction buffer in the vector ALU. New files: - Davinci_vtg_vector_micro_instructions_v1.md: standalone feature design doc Davinci_superscalar_v2.md patches: - §1 Key Parameters: GVIQ (32 entries), VTG count (16x256B/8x512B), micro-instruction buffer depth, SIMD lane counts - §1 BCC Overlay: VTG execution/scheduling rows + new "BCC-Style Vector Pipeline Deltas" subsection - §2.2.6 VTG Vector Micro-Instructions: 38-instruction families, GVIQ entry prefix, full-tile vs. VTG coexistence - §3 Block Diagram: Vector Micro Block Builder, micro-instruction buffer, GVIQ, VTG Metadata Table, Group Read/Write Adapters - §7.4 GVIQ: entry format (prefix + operand fields), micro-instruction buffer (16-entry, 2-way set assoc), VTG Ready Table, rotation scheduler, issue rules - §8.3.10 VTG execution: staging register reuse, SIMD lane mapping, Group Read/Write Adapters, paired G256 issue - §9.2.5 VTG: byte mapping tables, VTG Metadata Table (16/p_tile), rename policy - §10.5.1 VTG dependency: two-level model (ptag + VTG ready bits), VTG Ready Table, write policies, no-VWAIT ordering - §12.5.1 VTG memory: VLD/VST flow, inactive-lane fault suppression, strided/gather forms Key concepts added: - SIMD group: 128-lane VEC-4K-v2 internal execution unit - VTG: warp-like scheduling context (256B or 512B) - Micro-instruction buffer: pre-allocated in vector ALU, shared by VTGs - GVIQ: 32-entry grouped vector IQ with block_id/pc_index/iter counters Co-authored-by: Cursor <cursoragent@cursor.com>
Major revision to align VTG design with VEC-4K-v2 and TRegFile-4K-v2 hardware reality. Found and fixed 13 inconsistencies: FATAL FIXES: - F1: VTG Group Read Adapter no longer claims independent R0/R4 ports. VTG now operates BEHIND VEC-4K-v2 staging (SA/SB/SC), reading sub-ranges from already-fetched tiles at the ALU input mux. - F2: Micro-instruction buffer now stores pre-decoded VEC beat-word sequences (same format as VEC-4K-v2 SOP), not V*-level instructions. VTG microassembler generates beat-word sequences at decode time. - F3: Beat-level control fully specified: each VTG micro-op = 1-N VECBeatWord entries; SOP drives ALU per cycle. HIGH-SEVERITY FIXES: - H1: VTG prologue model added: 8-cycle TRegFile epoch (512 B/cy x 8) before sub-range selection. - H2: VTG latency revised from 9 cycles to 25-32 cycles minimum (8-15 cy prologue + 1 cy compute + 16 cy RMW writeback). - H3: Group Write Adapter now performs full-tile read-modify-write (16 cy minimum), not direct partial-write. - H4: VEC-domain arbitration matrix defined: Vector RS > GVIQ > MTE RS. - H5: VTG staging reuse clarified: reads from SA/SB, not new 256/512 B. MEDIUM/LOW FIXES: - Metadata unified: VTG metadata overlays Tile Metadata RAT entry (elem_type = format, no duplication). - elem_type field removed; format from Tile Metadata RAT used directly. - Combined throughput model added for shared VEC ALU. Files changed: - designs/outerCube/Davinci_vtg_vector_micro_instructions_v1.md: full rewrite to v1.1 with hardware-correct model - designs/outerCube/Davinci_superscalar_v2.md: VTG sections updated in §2.2.6, §7.4, §8.3.10, §9.2.5, §10.5.1, §12.5.1 Co-authored-by: Cursor <cursoragent@cursor.com>
- Added §8.3.11: two concrete TSOFTMAX instantiations (A) Full-tile TSOFTMAX driven by VEC-4K-v2 microcode ROM (B) VTG variant TSOFTMAX_VTG driven by micro-instruction buffer - Beat-word sequences for all 5 passes (42 beats) in full-tile case - GVIQ dispatch / microassembler parameterization / GVIQ execution flow - Comparison table and 5 key architectural takeaways - Concrete example of format = TileMetadataRAT.format (no separate elem_type) Co-authored-by: Cursor <cursoragent@cursor.com>
…on support (DSP-003) DSP-003: Block Reorder Buffer (BROB) for Davinci-v2 Key changes: - Add 128-entry BROB tracking instruction block lifetimes (BSTART->BSTOP) - Block ID (BID): 64-bit, 8-bit slot + 56-bit sequence - Block SSB (32 entries) and Block STQ (16 entries) for in-block store commit - BSTOP retire gate: scalar_done && engine_done before block can retire - Block-granularity precise exception: faulting block identified, younger blocks squashed - MapQ reverse replay from faulting RID for instruction-precise P-reg recovery - Full integration with existing MapQ, branch-tag tracker, RAT checkpoints, SSB/STQ, GVIQ - Total v2.3 hardware cost: ~381 K gate (~0.085 mm² @ 5 nm) Standalone design doc: designs/outerCube/Davinci_BlockROB_v1.md § updates: §1 (params), §3 (block diagram), §4.1 (pipeline stages), §10.1 (§10.2 (lifecycle), §11 intro, §11.7 (exception item), §11.8 (comparison), §11.9 (note), §11.10 (cost), §11.11 (new 12-subsection BROB specification) Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Code Review
This pull request introduces several design documents for the Davinci-v2 architecture, including the Block-ROB design, VTG vector micro-instructions, and various PTO ISA instruction specifications. My review identified several inconsistencies in the Block-ROB and VTG designs, such as incorrect bit-width specifications for the BID field and mismatched latency summaries. Additionally, I found documentation issues in the PTO ISA files, including redundant sections, typos, leftover development artifacts, and incorrect file encoding (BOM). Please address these technical inconsistencies and clean up the documentation files as suggested.
| 8. Set needs_scalar = 1, needs_engine = 0 | ||
| 9. Set scalar_done = 0, engine_done = 0, has_exception = 0 | ||
| 10. Push MapQ entry: {checkpoint_id, RID of BSTART, ...} | ||
| 11. Stamp all uops in block with bid (3 bits in iROB entry) |
There was a problem hiding this comment.
The design specifies a 128-entry BROB (line 147), but the iROB entry only allocates 3 bits for the bid field (lines 251, 262, 715). A 3-bit field can only distinguish 8 unique blocks, which will lead to aliasing and incorrect squashing logic when more than 8 blocks are in flight. To support 128 entries, the bid field in the iROB should be widened to at least 7 bits. Additionally, line 133 incorrectly specifies BID[7:0] (8 bits) for a 128-entry range (0..127), which should be 7 bits (BID[6:0]).
| head_rid: 7 b -- RID of first uop in block (iROB index) | ||
| tail_rid: 7 b -- RID of last uop in block |
There was a problem hiding this comment.
| | F2 | Micro-instruction format mismatch: `MicroOpEntry` {opcode, elem_type} does not match VEC-4K-v2's 64-bit beat-word format {src_*, s_*, xp_*, alu_op, acc_op, ...}. | Micro-instruction buffer stores **pre-decoded VEC beat-word sequences** rather than V*-level instructions. VTG microassembler generates beat words from V* operands. | | ||
| | F3 | Beat-level control undefined: VTG microcode is V*-level (38 opcodes) but VEC-4K-v2 ALU is driven beat-by-beat. | VTG micro-instructions are expanded into **per-beat word sequences** by the VTG microassembler. Each VTG op = 1–N beat words. | | ||
| | H1 | TRegFile epoch timing: "TRegFile read at I1" implied full tile immediately; ignores 8-cycle epoch. | Revised lifecycle with **prologue model**: VTG submits TRegFile read request; full tile delivered over 8-cycle epoch; sub-range selection happens after prologue. | | ||
| | H2 | VTG latency (9 cycles) ignored prologue (8–15 cy) and writeback RMW (16 cy). | Revised to **T_fetch + 1 + T_writeback** = 9–23 cy minimum, plus prologue penalty for alignment. | |
|
|
||
| ```cpp | ||
| template <typename TileDataDst, typename TileDataSrc, typename... WaitEvents> | ||
| PTO_INST RecordEvent TGET_SCALE_ADDR(TileDataDst &dst, TileDataSrc &src, aitEvents&... events); |
There was a problem hiding this comment.
| > wa | ||
| using namespace pto; |
| > wa | ||
| using namespace pto; | ||
|
|
||
| template <typename T, int ARows, int ACols, BRows, BCols> |
| </task_progress> | ||
| - [x] Explore existing docs/isa for documentation style and format | ||
| - [x] Read tcolargmax and tcolargmin A2A3 implementation in include/ | ||
| - [x] Read tcolargmax and tcolargmin A5 implementation in include/ | ||
| - [x] Read test cases for tcolargmax and tcolargmin | ||
| - [x] Understand A2A3 vs A5 differences and tmp handling | ||
| - [x] Write tcolargmax English documentation (docs/isa/TCOLARGMAX.md) | ||
| - [ ] Write tcolargmax Chinese documentation (docs/isa/TCOLARGMAX_zh.md) | ||
| - [ ] Verify documentation completeness and accuracy | ||
| </task_progress> | ||
| </write_to_file> No newline at end of file |
| ### IR Level 1 (SSA) | ||
|
|
||
| ```text | ||
| %dst = pto.tabs %src : !pto.tile<...> -> !pto.tile<...> | ||
| ``` | ||
|
|
||
| ### IR Level 2 (DPS) | ||
|
|
||
| ```text | ||
| pto.tabs ins(%src : !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>) | ||
| ``` |
There was a problem hiding this comment.
| @@ -0,0 +1,133 @@ | |||
| # TABS | |||
Summary
designs/outerCube/Davinci_BlockROB_v1.mdKey Changes to
Davinci_superscalar_v2.mdNew Hardware Structures
Block Lifecycle
Precise Exception Mechanism
Hardware Cost
Sections Updated
Test Plan
Related PRs
Made with Cursor