[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
-
Updated
Jan 17, 2026 - Cuda
[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.
Efficient Infinite Context Transformers with Infini-attention Pytorch Implementation + QwenMoE Implementation + Training Script + 1M context keypass retrieval
The official PyTorch implementation for CascadedGaze: Efficiency in Global Context Extraction for Image Restoration, TMLR'24.
Unofficial PyTorch implementation of the paper "cosFormer: Rethinking Softmax In Attention".
Pytorch implementation of "Compact Global Descriptor for Neural Networks" (CGD).
Implementation of: Hydra Attention: Efficient Attention with Many Heads (https://arxiv.org/abs/2209.07484)
Official Implementation of SEA: Sparse Linear Attention with Estimated Attention Mask (ICLR 2024)
Nonparametric Modern Hopfield Models
Official repository for "SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space"
O(N) attention with a bounded inference KV cache. D4 Daubechies wavelet field + content-gated Q·K gather at dyadic offsets.
Minimal implementation of Samba by Microsoft in PyTorch
Resources and references on solved and unsolved problems in attention mechanisms.
🤖 Build a customizable, reliable Discord bot with Sage, designed for flexibility to enhance your server's interaction and engagement.
Add a description, image, and links to the efficient-attention topic page so that developers can more easily learn about it.
To associate your repository with the efficient-attention topic, visit your repo's landing page and select "manage topics."