-
Notifications
You must be signed in to change notification settings - Fork 15
[feat] FU-level fusion #244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…into hardware_merge
| HardwarePattern(int64_t i, const std::string& n, int64_t f); | ||
| }; | ||
|
|
||
| struct HardwareSlot { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Define/comment HW slot with example. FU in same slot cannot be executed at the same time. Slot is good for which case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added. u can check it.
|
|
||
| // Execution plan for a pattern on a hardware template. | ||
| struct PatternExecutionPlan { | ||
| int64_t patternId; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plz refactor all the variables naming. patternId -> pattern_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
| // RUN: --fold-constant \ | ||
| // RUN: --transform-ctrl-to-data-flow \ | ||
| // RUN: --fold-constant \ | ||
| // RUN: --iter-merge-pattern="min-support=3 max-iter=4" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plz remind me, after merging, the II would be improved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! In test.mlir the rec_mii decreases from 9 to 8 and res_mii decreases from 5 to 3.
If we use the same mapping strategy (customize) the mapping ii will decrease from 12 to 9.
| // CHECK-HARDWARE-MERGE: "template_id": 0, | ||
| // CHECK-HARDWARE-MERGE: "instance_count": 2, | ||
| // CHECK-HARDWARE-MERGE: "supported_single_ops": ["neura.gep", "neura.load", "neura.phi_start", "neura.store"], | ||
| // CHECK-HARDWARE-MERGE: "supported_composite_ops": [ | ||
| // CHECK-HARDWARE-MERGE: {"pattern_id": 10, "name": "phi_start->fused_op:gep->load"}, | ||
| // CHECK-HARDWARE-MERGE: {"pattern_id": 0, "name": "gep->load"} | ||
| // CHECK-HARDWARE-MERGE: ], | ||
| // CHECK-HARDWARE-MERGE: "slots": [ | ||
| // CHECK-HARDWARE-MERGE: {"slot_id": 0, "supported_ops": ["neura.phi_start"]}, | ||
| // CHECK-HARDWARE-MERGE: {"slot_id": 1, "supported_ops": ["neura.gep"]}, | ||
| // CHECK-HARDWARE-MERGE: {"slot_id": 2, "supported_ops": ["neura.load"]} | ||
| // CHECK-HARDWARE-MERGE: ], | ||
| // CHECK-HARDWARE-MERGE: "slot_connections": { | ||
| // CHECK-HARDWARE-MERGE: "connections": [{"from": 0, "to": 1}, {"from": 1, "to": 2}] | ||
| // CHECK-HARDWARE-MERGE: }, | ||
| // CHECK-HARDWARE-MERGE: "pattern_execution_plans": [ | ||
| // CHECK-HARDWARE-MERGE: { | ||
| // CHECK-HARDWARE-MERGE: "pattern_id": 10, | ||
| // CHECK-HARDWARE-MERGE: "pattern_name": "phi_start->fused_op:gep->load", | ||
| // CHECK-HARDWARE-MERGE: "slot_mapping": [0, 1, 2], | ||
| // CHECK-HARDWARE-MERGE: "execution_stages": [ | ||
| // CHECK-HARDWARE-MERGE: { | ||
| // CHECK-HARDWARE-MERGE: "stage": 0, | ||
| // CHECK-HARDWARE-MERGE: "parallel_slots": [0], | ||
| // CHECK-HARDWARE-MERGE: "parallel_ops": ["neura.phi_start"] | ||
| // CHECK-HARDWARE-MERGE: }, | ||
| // CHECK-HARDWARE-MERGE: { | ||
| // CHECK-HARDWARE-MERGE: "stage": 1, | ||
| // CHECK-HARDWARE-MERGE: "parallel_slots": [1], | ||
| // CHECK-HARDWARE-MERGE: "parallel_ops": ["neura.gep"] | ||
| // CHECK-HARDWARE-MERGE: }, | ||
| // CHECK-HARDWARE-MERGE: { | ||
| // CHECK-HARDWARE-MERGE: "stage": 2, | ||
| // CHECK-HARDWARE-MERGE: "parallel_slots": [2], | ||
| // CHECK-HARDWARE-MERGE: "parallel_ops": ["neura.load"] | ||
| // CHECK-HARDWARE-MERGE: } | ||
| // CHECK-HARDWARE-MERGE: ] | ||
| // CHECK-HARDWARE-MERGE: }, | ||
| // CHECK-HARDWARE-MERGE: { | ||
| // CHECK-HARDWARE-MERGE: "pattern_id": 0, | ||
| // CHECK-HARDWARE-MERGE: "pattern_name": "gep->load", | ||
| // CHECK-HARDWARE-MERGE: "slot_mapping": [1, 2], | ||
| // CHECK-HARDWARE-MERGE: "execution_stages": [ | ||
| // CHECK-HARDWARE-MERGE: { | ||
| // CHECK-HARDWARE-MERGE: "stage": 0, | ||
| // CHECK-HARDWARE-MERGE: "parallel_slots": [1], | ||
| // CHECK-HARDWARE-MERGE: "parallel_ops": ["neura.gep"] | ||
| // CHECK-HARDWARE-MERGE: }, | ||
| // CHECK-HARDWARE-MERGE: { | ||
| // CHECK-HARDWARE-MERGE: "stage": 1, | ||
| // CHECK-HARDWARE-MERGE: "parallel_slots": [2], | ||
| // CHECK-HARDWARE-MERGE: "parallel_ops": ["neura.load"] | ||
| // CHECK-HARDWARE-MERGE: } | ||
| // CHECK-HARDWARE-MERGE: ] | ||
| // CHECK-HARDWARE-MERGE: } | ||
| // CHECK-HARDWARE-MERGE: ] | ||
| // CHECK-HARDWARE-MERGE: }, | ||
| // CHECK-HARDWARE-MERGE: { | ||
| // CHECK-HARDWARE-MERGE: "template_id": 1, | ||
| // CHECK-HARDWARE-MERGE: "instance_count": 3, | ||
| // CHECK-HARDWARE-MERGE: "supported_single_ops": ["neura.grant_once", "neura.grant_predicate", "neura.icmp"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use this example to show what each field means (maybe draw it), and put it into PR's description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Field Explanations
| Field | Description |
|---|---|
template_id |
Unique identifier for this template |
instance_count |
Number of instances each pattern |
supported_single_ops |
Individual operations this template can execute standalone |
supported_composite_ops |
Fused patterns this template can execute |
slots |
Array of slot definitions with their supported operations |
slot_connections |
Data routing paths between slots |
pattern_execution_plans |
Detailed execution schedules for each pattern |
Execution Plan Fields
| Field | Description |
|---|---|
pattern_id |
Pattern being executed |
pattern_name |
pattern name |
slot_mapping |
Maps operation index to slot ID: [op0→slot0, op1→slot1, ...], e.g. [1, 2] means we will use slot 1 and slot2 to execute op0 and op1 of this pattern. |
execution_stages |
Ordered stages of execution |
parallel_slots |
Slots executing in this stage (can be multiple for parallel ops) |
parallel_ops |
Operations executing in this stage |
Note that parallel_ops is corresponding to parallel_slots. e.g.,
Example
Template 1 (instance_count: 3)
══════════════════════════════════════════════════════════════════
Pipeline Structure:
┌────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│ Slot 0 │ ──── │ Slot 1 │ ──── │ Slot 2 │
│ icmp │ │ grant_predicate │ │ grant_predicate │
└────────────────┘ └─────────────────────┘ └─────────────────────┘
Supported Patterns:
- Pattern 1: icmp->grant_predicate->grant_predicate (full pipeline)
- Pattern 3: icmp->grant_predicate (bypass slot 1)
- Pattern 2: grant_predicate->grant_predicate (bypass slot 0)
Pattern 1 (icmp->grant_predicate->grant_predicate) shows parallel execution:
Stage 0: Slot 0 executes icmp
Stage 1: Slot 1 AND Slot 2 execute grant_predicate IN PARALLEL
(both depend only on icmp output)
{
"pattern_id": 1,
"pattern_name": "fused_op:icmp->grant_predicate->grant_predicate",
"slot_mapping": [0, 1, 2],
"execution_stages": [
{
"stage": 0,
"parallel_slots": [0],
"parallel_ops": ["neura.icmp"]
},
{
"stage": 1,
"parallel_slots": [1, 2],
"parallel_ops": ["neura.grant_predicate", "neura.grant_predicate"]
}
]
}Note: In Stage 1, slots 1 and 2 execute simultaneously because both grant_predicate operations have the same topological level (both depend on icmp).
This pass implements the FU-level fusion after DFG-level fusion (PR 194), which aims to find the minimum FU cost that can cover all patterns extracted in previous passes (wrapped in
fused_op).The algorithm can be depicted as below: