Skip to content

Conversation

@HobbitQia
Copy link
Collaborator

This pass implements the FU-level fusion after DFG-level fusion (PR 194), which aims to find the minimum FU cost that can cover all patterns extracted in previous passes (wrapped in fused_op).

The algorithm can be depicted as below:

  1. Pattern Extraction: Extracts fused operation patterns from the module and linearizes them via topological sort.
  2. Standalone Operation Extraction: Collects standalone operations not inside fused patterns for hardware coverage.
  3. Template Creation: Greedily merges patterns into shared hardware templates using cost-based accommodation with DFS mapping search.
  4. Connection Generation: Generates optimized slot connections based on pattern dependencies with bypass support.
  5. Execution Plan Generation: Creates parallel execution stages by grouping operations at the same topological level.
  6. JSON Output: Writes hardware configuration including templates, connections, and execution plans to JSON file.

@HobbitQia HobbitQia requested a review from tancheng January 22, 2026 13:34
HardwarePattern(int64_t i, const std::string& n, int64_t f);
};

struct HardwareSlot {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define/comment HW slot with example. FU in same slot cannot be executed at the same time. Slot is good for which case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added. u can check it.


// Execution plan for a pattern on a hardware template.
struct PatternExecutionPlan {
int64_t patternId;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plz refactor all the variables naming. patternId -> pattern_id

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

// RUN: --fold-constant \
// RUN: --transform-ctrl-to-data-flow \
// RUN: --fold-constant \
// RUN: --iter-merge-pattern="min-support=3 max-iter=4" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plz remind me, after merging, the II would be improved?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! In test.mlir the rec_mii decreases from 9 to 8 and res_mii decreases from 5 to 3.

If we use the same mapping strategy (customize) the mapping ii will decrease from 12 to 9.

Comment on lines +139 to +199
// CHECK-HARDWARE-MERGE: "template_id": 0,
// CHECK-HARDWARE-MERGE: "instance_count": 2,
// CHECK-HARDWARE-MERGE: "supported_single_ops": ["neura.gep", "neura.load", "neura.phi_start", "neura.store"],
// CHECK-HARDWARE-MERGE: "supported_composite_ops": [
// CHECK-HARDWARE-MERGE: {"pattern_id": 10, "name": "phi_start->fused_op:gep->load"},
// CHECK-HARDWARE-MERGE: {"pattern_id": 0, "name": "gep->load"}
// CHECK-HARDWARE-MERGE: ],
// CHECK-HARDWARE-MERGE: "slots": [
// CHECK-HARDWARE-MERGE: {"slot_id": 0, "supported_ops": ["neura.phi_start"]},
// CHECK-HARDWARE-MERGE: {"slot_id": 1, "supported_ops": ["neura.gep"]},
// CHECK-HARDWARE-MERGE: {"slot_id": 2, "supported_ops": ["neura.load"]}
// CHECK-HARDWARE-MERGE: ],
// CHECK-HARDWARE-MERGE: "slot_connections": {
// CHECK-HARDWARE-MERGE: "connections": [{"from": 0, "to": 1}, {"from": 1, "to": 2}]
// CHECK-HARDWARE-MERGE: },
// CHECK-HARDWARE-MERGE: "pattern_execution_plans": [
// CHECK-HARDWARE-MERGE: {
// CHECK-HARDWARE-MERGE: "pattern_id": 10,
// CHECK-HARDWARE-MERGE: "pattern_name": "phi_start->fused_op:gep->load",
// CHECK-HARDWARE-MERGE: "slot_mapping": [0, 1, 2],
// CHECK-HARDWARE-MERGE: "execution_stages": [
// CHECK-HARDWARE-MERGE: {
// CHECK-HARDWARE-MERGE: "stage": 0,
// CHECK-HARDWARE-MERGE: "parallel_slots": [0],
// CHECK-HARDWARE-MERGE: "parallel_ops": ["neura.phi_start"]
// CHECK-HARDWARE-MERGE: },
// CHECK-HARDWARE-MERGE: {
// CHECK-HARDWARE-MERGE: "stage": 1,
// CHECK-HARDWARE-MERGE: "parallel_slots": [1],
// CHECK-HARDWARE-MERGE: "parallel_ops": ["neura.gep"]
// CHECK-HARDWARE-MERGE: },
// CHECK-HARDWARE-MERGE: {
// CHECK-HARDWARE-MERGE: "stage": 2,
// CHECK-HARDWARE-MERGE: "parallel_slots": [2],
// CHECK-HARDWARE-MERGE: "parallel_ops": ["neura.load"]
// CHECK-HARDWARE-MERGE: }
// CHECK-HARDWARE-MERGE: ]
// CHECK-HARDWARE-MERGE: },
// CHECK-HARDWARE-MERGE: {
// CHECK-HARDWARE-MERGE: "pattern_id": 0,
// CHECK-HARDWARE-MERGE: "pattern_name": "gep->load",
// CHECK-HARDWARE-MERGE: "slot_mapping": [1, 2],
// CHECK-HARDWARE-MERGE: "execution_stages": [
// CHECK-HARDWARE-MERGE: {
// CHECK-HARDWARE-MERGE: "stage": 0,
// CHECK-HARDWARE-MERGE: "parallel_slots": [1],
// CHECK-HARDWARE-MERGE: "parallel_ops": ["neura.gep"]
// CHECK-HARDWARE-MERGE: },
// CHECK-HARDWARE-MERGE: {
// CHECK-HARDWARE-MERGE: "stage": 1,
// CHECK-HARDWARE-MERGE: "parallel_slots": [2],
// CHECK-HARDWARE-MERGE: "parallel_ops": ["neura.load"]
// CHECK-HARDWARE-MERGE: }
// CHECK-HARDWARE-MERGE: ]
// CHECK-HARDWARE-MERGE: }
// CHECK-HARDWARE-MERGE: ]
// CHECK-HARDWARE-MERGE: },
// CHECK-HARDWARE-MERGE: {
// CHECK-HARDWARE-MERGE: "template_id": 1,
// CHECK-HARDWARE-MERGE: "instance_count": 3,
// CHECK-HARDWARE-MERGE: "supported_single_ops": ["neura.grant_once", "neura.grant_predicate", "neura.icmp"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use this example to show what each field means (maybe draw it), and put it into PR's description?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Field Explanations

Field Description
template_id Unique identifier for this template
instance_count Number of instances each pattern
supported_single_ops Individual operations this template can execute standalone
supported_composite_ops Fused patterns this template can execute
slots Array of slot definitions with their supported operations
slot_connections Data routing paths between slots
pattern_execution_plans Detailed execution schedules for each pattern

Execution Plan Fields

Field Description
pattern_id Pattern being executed
pattern_name pattern name
slot_mapping Maps operation index to slot ID: [op0→slot0, op1→slot1, ...], e.g. [1, 2] means we will use slot 1 and slot2 to execute op0 and op1 of this pattern.
execution_stages Ordered stages of execution
parallel_slots Slots executing in this stage (can be multiple for parallel ops)
parallel_ops Operations executing in this stage

Note that parallel_ops is corresponding to parallel_slots. e.g.,

Example

Template 1 (instance_count: 3)
══════════════════════════════════════════════════════════════════

Pipeline Structure:
┌────────────────┐      ┌─────────────────────┐      ┌─────────────────────┐
│    Slot 0      │ ──── │       Slot 1        │ ──── │       Slot 2        │
│     icmp       │      │  grant_predicate    │      │  grant_predicate    │
└────────────────┘      └─────────────────────┘      └─────────────────────┘

Supported Patterns:
  - Pattern 1: icmp->grant_predicate->grant_predicate (full pipeline)
  - Pattern 3: icmp->grant_predicate (bypass slot 1)
  - Pattern 2: grant_predicate->grant_predicate (bypass slot 0)

Pattern 1 (icmp->grant_predicate->grant_predicate) shows parallel execution:

Stage 0: Slot 0 executes icmp
Stage 1: Slot 1 AND Slot 2 execute grant_predicate IN PARALLEL
         (both depend only on icmp output)
{
  "pattern_id": 1,
  "pattern_name": "fused_op:icmp->grant_predicate->grant_predicate",
  "slot_mapping": [0, 1, 2],
  "execution_stages": [
    {
      "stage": 0,
      "parallel_slots": [0],
      "parallel_ops": ["neura.icmp"]
    },
    {
      "stage": 1,
      "parallel_slots": [1, 2],
      "parallel_ops": ["neura.grant_predicate", "neura.grant_predicate"]
    }
  ]
}

Note: In Stage 1, slots 1 and 2 execute simultaneously because both grant_predicate operations have the same topological level (both depend on icmp).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants