Skip to content

Implement Galerkin Transformer for Operator Learning #116

@ChrisRackauckas-Claude

Description

@ChrisRackauckas-Claude

Summary

Implement the Galerkin Transformer, which replaces softmax attention with linear attention inspired by Petrov-Galerkin projection for PDE operator learning.

Reference

  • Cao, "Choose a Transformer: Fourier or Galerkin," NeurIPS 2021. arXiv:2105.14995

Description

The Galerkin Transformer removes softmax normalization from attention and uses Q(K^T V) (Galerkin-type) or (QK^T)V (Fourier-type) attention, which mimics Petrov-Galerkin projection in finite element methods. This achieves significant improvements in training cost and accuracy compared to softmax-normalized counterparts for operator learning tasks.

Key features:

  • Linear attention (no softmax) with O(n) complexity
  • Galerkin-type: Q(K^T V) — analogous to Petrov-Galerkin projection
  • Fourier-type: (QK^T)V — analogous to Fourier integral operator

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions