Implement Galerkin Transformer for Operator Learning

## Summary

Implement the Galerkin Transformer, which replaces softmax attention with linear attention inspired by Petrov-Galerkin projection for PDE operator learning.

## Reference

- Cao, "Choose a Transformer: Fourier or Galerkin," *NeurIPS 2021*. [arXiv:2105.14995](https://arxiv.org/abs/2105.14995)

## Description

The Galerkin Transformer removes softmax normalization from attention and uses Q(K^T V) (Galerkin-type) or (QK^T)V (Fourier-type) attention, which mimics Petrov-Galerkin projection in finite element methods. This achieves significant improvements in training cost and accuracy compared to softmax-normalized counterparts for operator learning tasks.

Key features:
- Linear attention (no softmax) with O(n) complexity
- Galerkin-type: Q(K^T V) — analogous to Petrov-Galerkin projection
- Fourier-type: (QK^T)V — analogous to Fourier integral operator

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement Galerkin Transformer for Operator Learning #116

Summary

Reference

Description

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Implement Galerkin Transformer for Operator Learning #116

Description

Summary

Reference

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions