ActionFormer: Localizing Moments of Actions with Transformers

Enhanced fork of ActionFormer with multi-GPU training and modern transformer architecture.

Fork Enhancements

Feature	Description	Docs
Multi-GPU Training	DDP, AMP, gradient accumulation	Guide
Transformer v2	Flash Attention, RoPE, RMSNorm, SwiGLU	Guide
Detection Quality	EIoU, DIoU-NMS, temperature scaling	Guide
SnapFormer	Heatmap-based point detection for instant events	Config
TBTFormer	Boundary Distribution Regression for noisy labels	Config
Framework Utils	Config validation, adaptive ranges, post-processing	Guide
Use Cases	Configuration recommendations	Guide

Quick Start

# Multi-GPU training with all optimizations
torchrun --nproc_per_node=4 train_ddp.py configs/thumos_i3d.yaml --amp --output my_exp

# Test the improvements
python test_improvements.py

Multi-GPU Training (DDP)

DistributedDataParallel for efficient multi-GPU training via torchrun
Automatic Mixed Precision (AMP) for ~2x faster training
Gradient Accumulation for larger effective batch sizes
Distributed Validation during training
Multi-node Support for cluster training

Modernized Transformer Architecture (v2)

Component	Description	Benefit
Flash Attention	PyTorch 2.0+ optimized attention	2-4x faster, O(T) memory
RoPE	Rotary Position Embeddings	Better length generalization
RMSNorm	Root Mean Square normalization	1.8x faster than LayerNorm
SwiGLU	Gated Linear Unit FFN	Better quality per parameter
GQA	Grouped Query Attention	Faster inference

v2 Configuration

model:
  backbone_type: convTransformerv2
  backbone:
    use_rope: true
    use_flash_attn: true
    use_swiglu: true
    use_rms_norm: true

Detection Quality Improvements

Component	Description	Benefit
EIoU Loss	Enhanced IoU with aspect ratio penalty	Better convergence
QualityFocal Loss	IoU-aware classification	Aligns confidence with localization
DIoU-NMS	Distance-aware suppression	Better handling of overlapping events
Temperature Scaling	Score calibration	Improved probability estimates
Class-specific NMS	Per-class suppression params	Tailored for event types
Deeper Heads (v2)	Skip connections	Better gradient flow

test_cfg:
  use_diou_nms: true
  score_temperature: 1.3
  class_sigma:
    0: 0.7  # short events
    1: 0.4  # medium events

SnapFormer (Point Detection)

For detecting instant/point events (e.g., snap detection in football, keystroke detection), SnapFormer uses heatmap regression instead of segment boundary regression.

Component	Description
Gaussian Heatmap	Targets are Gaussian peaks centered at event frames
Focal Loss	CornerNet-style focal loss for class imbalance
Cross-Scale FPN	Bidirectional PANet-style feature fusion
Peak Detection	Inference via local maxima finding

When to use SnapFormer vs ActionFormer:

SnapFormer: Events with near-zero duration (snaps, clicks, impacts)
ActionFormer: Events with meaningful duration (diving, running, speaking)

# SnapFormer config for point detection
model:
  meta_arch: "SnapFormer"
  fpn_type: "cs_fpn"          # Cross-scale FPN (bidirectional)
  # ... other params same as ActionFormer

# Standard ActionFormer for segment detection
model:
  meta_arch: "LocPointTransformer"
  fpn_type: "fpn"             # Standard top-down FPN

TBTFormer (Boundary Distribution)

For action segmentation with noisy annotations or uncertain boundaries, TBTFormer uses probabilistic boundary regression instead of direct offset prediction.

Component	Description
BDR Head	Predicts probability distribution over boundary locations
Distribution Focal Loss	Cross-entropy over adjacent bins for smooth learning
Scaled Backbone	Optional 16 heads, 6x MLP for larger capacity

When to use TBTFormer vs ActionFormer:

TBTFormer: Datasets with annotation noise, ambiguous boundaries
ActionFormer: Clean annotations, well-defined action boundaries

# TBTFormer config
model:
  meta_arch: "TBTFormer"
  fpn_type: "cs_fpn"          # Cross-scale FPN recommended
  reg_max: 16                 # Distribution bins (default: 16)
  dfl_weight: 0.25            # DFL loss weight

  # Optional: Scaled backbone (TBT-Former paper settings)
  backbone:
    n_head: 16                # Default: 8
    n_hidden: 1536            # 6x embed dim (default: 4x = 1024)

Framework Utilities

Component	Description
Config Validation	Catch misconfigurations early (FPN alignment, sequence divisibility)
Adaptive Ranges	Compute regression ranges from annotation distribution
Loss Logger	Moving averages with IoU stats for debugging
Temporal Constraints	Post-process with min gap, duration limits, overlap rules

from actionformer import validate_config, compute_adaptive_ranges, TemporalConstraints

validate_config(cfg)  # Raises on invalid config
ranges = compute_adaptive_ranges('annotations.json', num_levels=6)
filtered = TemporalConstraints(min_gap={0: 5.0}).apply(results)

Original Implementation

This fork is based on the original ActionFormer by Zhang et al. (ECCV 2022):

Original Repository: happyharrycn/actionformer_release
Paper: ActionFormer: Localizing Moments of Actions with Transformers
Authors: Chen-Lin Zhang, Jianxin Wu, Yin Li

If you use this code, please cite the original paper:

@inproceedings{zhang2022actionformer,
  title={ActionFormer: Localizing Moments of Actions with Transformers},
  author={Zhang, Chen-Lin and Wu, Jianxin and Li, Yin},
  booktitle={European Conference on Computer Vision},
  pages={492-510},
  year={2022}
}

Introduction

ActionFormer is one of the first Transformer-based models for temporal action localization --- detecting the onsets and offsets of action instances and recognizing their action categories. Without bells and whistles, ActionFormer achieves 71.0% mAP at tIoU=0.5 on THUMOS14, outperforming the best prior model by 14.1 absolute percentage points. Further, ActionFormer demonstrates strong results on ActivityNet 1.3 (36.56% average mAP) and EPIC-Kitchens 100 (+13.5% average mAP over prior works).

The model adapts local self-attention to model temporal context in untrimmed videos, classifies every moment in an input video, and regresses their corresponding action boundaries. The result is a deep model trained using standard classification and regression loss that can localize moments of actions in a single shot.

Related projects:

SnAG: Scalable and Accurate Video Grounding
Fangzhou Mu*, Sicheng Mo*, Yin Li
CVPR 2024

Changelog

11/18/2022: We have released the tech report for our submission to the Ego4D Moment Queries (MQ) Challenge. The code repo now includes config files, pre-trained models and results on the Ego4D MQ benchmark.
08/29/2022: Updated arXiv version.
08/01/2022: Updated code repo with latest results on ActivityNet.
07/08/2022: The paper is accepted to ECCV 2022.
05/09/2022: Pre-trained models have been updated.
05/08/2022: We have updated the code repo based on the community feedback and our code review, leading to significantly better average mAP on THUMOS14 (>66.0%) and slightly improved results on ActivityNet and EPIC-Kitchens 100.

Code Overview

The structure of this code repo is heavily inspired by Detectron2. Some of the main components are

./libs/core: Parameter configuration module.
./libs/datasets: Data loader and IO module.
./libs/modeling: Our main model with all its building blocks.
./libs/utils: Utility functions for training, inference, and postprocessing.

Installation

Follow INSTALL.md for installing necessary dependencies and compiling the code.

Frequently Asked Questions

See FAQ.md.

To Reproduce Our Results on THUMOS14

Download Features and Annotations

Download thumos.tar.gz (md5sum 375f76ffbf7447af1035e694971ec9b2) from this Box link or this Google Drive link or this BaiduYun link.
The file includes I3D features, action annotations in json format (similar to ActivityNet annotation format), and external classification scores.

Details: The features are extracted from two-stream I3D models pretrained on Kinetics using clips of 16 frames at the video frame rate (~30 fps) and a stride of 4 frames. This gives one feature vector per 4/30 ~= 0.1333 seconds.

Unpack Features and Annotations

Unpack the file under ./data (or elsewhere and link to ./data).
The folder structure should look like

This folder
│   README.md
│   ...  
│
└───data/
│    └───thumos/
│    │	 └───annotations
│    │	 └───i3d_features   
│    └───...
|
└───libs
│
│   ...

Training and Evaluation

Train our ActionFormer with I3D features. This will create an experiment folder under ./ckpt that stores training config, logs, and checkpoints.

python ./train.py ./configs/thumos_i3d.yaml --output reproduce

[Optional] Multi-GPU Training with DDP

For faster training on multiple GPUs, use DistributedDataParallel (DDP):

# Basic 4-GPU training
torchrun --nproc_per_node=4 ./train_ddp.py ./configs/thumos_i3d.yaml --output reproduce

# With AMP (mixed precision) for ~2x faster training
torchrun --nproc_per_node=4 ./train_ddp.py ./configs/thumos_i3d.yaml --amp --output reproduce

# Full example with all features
torchrun --nproc_per_node=4 ./train_ddp.py ./configs/thumos_i3d.yaml \
    --amp --accum-steps 2 --eval-freq 5 --output reproduce

[Optional] Modernized Transformer (v2)

To use the v2 backbone with Flash Attention, RoPE, RMSNorm, and SwiGLU:

model:
  backbone_type: convTransformerv2
  backbone:
    use_rope: true
    use_flash_attn: true
    use_swiglu: true
    use_rms_norm: true

[Optional] Monitor the training using TensorBoard

tensorboard --logdir=./ckpt/thumos_i3d_reproduce/logs

Evaluate the trained model. The expected average mAP should be around 62.6(%) as in Table 1 of our main paper. With recent commits, the expected average mAP should be higher than 66.0(%).

python ./eval.py ./configs/thumos_i3d.yaml ./ckpt/thumos_i3d_reproduce

Training our model on THUMOS requires ~4.5GB GPU memory, yet the inference might require over 10GB GPU memory. We recommend using a GPU with at least 12 GB of memory.

[Optional] Evaluating Our Pre-trained Model

We also provide a pre-trained model for THUMOS 14. The model with all training logs can be downloaded from this Google Drive link. To evaluate the pre-trained model, please follow the steps listed below.

Create a folder ./pretrained and unpack the file under ./pretrained (or elsewhere and link to ./pretrained).
The folder structure should look like

This folder
│   README.md
│   ...  
│
└───pretrained/
│    └───thumos_i3d_reproduce/
│    │	 └───thumos_reproduce_log.txt
│    │	 └───thumos_reproduce_results.txt
│    │   └───...    
│    └───...
|
└───libs
│
│   ...

The training config is recorded in ./pretrained/thumos_i3d_reproduce/config.txt.
The training log is located at ./pretrained/thumos_i3d_reproduce/thumos_reproduce_log.txt and also ./pretrained/thumos_i3d_reproduce/logs.
The pre-trained model is ./pretrained/thumos_i3d_reproduce/epoch_034.pth.tar.
Evaluate the pre-trained model.

python ./eval.py ./configs/thumos_i3d.yaml ./pretrained/thumos_i3d_reproduce/

The results (mAP at tIoUs) should be

Method	0.3	0.4	0.5	0.6	0.7	Avg
ActionFormer	82.13	77.80	70.95	59.40	43.87	66.83

To Reproduce Our Results on ActivityNet 1.3

Download Features and Annotations

Download anet_1.3.tar.gz (md5sum c415f50120b9425ee1ede9ac3ce11203) from this Box link or this Google Drive Link or this BaiduYun Link.
The file includes TSP features, action annotations in json format (similar to ActivityNet annotation format), and external classification scores.

Details: The features are extracted from the R(2+1)D-34 model pretrained with TSP on ActivityNet using clips of 16 frames at a frame rate of 15 fps and a stride of 16 frames (i.e., non-overlapping clips). This gives one feature vector per 16/15 ~= 1.067 seconds. The features are converted into numpy files for our code.

Unpack Features and Annotations

Unpack the file under ./data (or elsewhere and link to ./data).
The folder structure should look like

This folder
│   README.md
│   ...  
│
└───data/
│    └───anet_1.3/
│    │	 └───annotations
│    │	 └───tsp_features   
│    └───...
|
└───libs
│
│   ...

Training and Evaluation

Train our ActionFormer with TSP features. This will create an experiment folder under ./ckpt that stores training config, logs, and checkpoints.

python ./train.py ./configs/anet_tsp.yaml --output reproduce

[Optional] Monitor the training using TensorBoard

tensorboard --logdir=./ckpt/anet_tsp_reproduce/logs

Evaluate the trained model. The expected average mAP should be around 36.5(%) as in Table 1 of our main paper.

python ./eval.py ./configs/anet_tsp.yaml ./ckpt/anet_tsp_reproduce

Training our model on ActivityNet requires ~4.6GB GPU memory, yet the inference might require over 10GB GPU memory. We recommend using a GPU with at least 12 GB of memory.

[Optional] Evaluating Our Pre-trained Model

We also provide a pre-trained model for ActivityNet 1.3. The model with all training logs can be downloaded from this Google Drive link. To evaluate the pre-trained model, please follow the steps listed below.

Create a folder ./pretrained and unpack the file under ./pretrained (or elsewhere and link to ./pretrained).
The folder structure should look like

This folder
│   README.md
│   ...  
│
└───pretrained/
│    └───anet_tsp_reproduce/
│    │	 └───anet_tsp_reproduce_log.txt
│    │	 └───anet_tsp_reproduce_results.txt
│    │   └───...    
│    └───...
|
└───libs
│
│   ...

The training config is recorded in ./pretrained/anet_tsp_reproduce/config.txt.
The training log is located at ./pretrained/anet_tsp_reproduce/anet_tsp_reproduce_log.txt and also ./pretrained/anet_tsp_reproduce/logs.
The pre-trained model is ./pretrained/anet_tsp_reproduce/epoch_014.pth.tar.
Evaluate the pre-trained model.

python ./eval.py ./configs/anet_tsp.yaml ./pretrained/anet_tsp_reproduce/

The results (mAP at tIoUs) should be

Method	0.5	0.75	0.95	Avg
ActionFormer	54.67	37.81	8.36	36.56

[Optional] Reproducing Our Results with I3D Features

Download anet_1.3_i3d.tar.gz (md5sum e649425954e0123401650312dd0d56a7) from this Google Drive Link.

Details: The features are extracted from the I3D model pretrained on Kinetics using clips of 16 frames at a frame rate of 25 fps and a stride of 16 frames. This gives one feature vector per 16/25 = 0.64 seconds. The features are converted into numpy files for our code.

Unpack the file under ./data (or elsewhere and link to ./data), similar to TSP features.
Train our ActionFormer with I3D features. This will create an experiment folder under ./ckpt that stores training config, logs, and checkpoints.

python ./train.py ./configs/anet_i3d.yaml --output reproduce

Evaluate the trained model. The expected average mAP should be around 36.0(%). This is slightly improved from our paper. The improvement is produced by better training scheme / hyperparameters (see comments in the config file).

python ./eval.py ./configs/anet_i3d.yaml ./ckpt/anet_i3d_reproduce

The pre-trained model with all training logs can be downloaded from this Google Drive link. To produce the results, create a folder ./pretrained, unpack the file under ./pretrained (or elsewhere and link to ./pretrained), and run

python ./eval.py ./configs/anet_i3d.yaml ./pretrained/anet_i3d_reproduce/

The results (mAP at tIoUs) with I3D features should be

Method	0.5	0.75	0.95	Avg
ActionFormer	54.29	36.71	8.24	36.03

To Reproduce Our Results on EPIC Kitchens 100

Download Features and Annotations

Download epic_kitchens.tar.gz (md5sum add9803756afd9a023bc9a9c547e0229) from this Box link or this Google Drive Link or this BaiduYun Link.
The file includes SlowFast features as well as action annotations in json format (similar to ActivityNet annotation format).

Details: The features are extracted from the SlowFast model pretrained on the training set of EPIC Kitchens 100 (action classification) using clips of 32 frames at a frame rate of 30 fps and a stride of 16 frames. This gives one feature vector per 16/30 ~= 0.5333 seconds.

Unpack Features and Annotations

Unpack the file under ./data (or elsewhere and link to ./data).
The folder structure should look like

This folder
│   README.md
│   ...  
│
└───data/
│    └───epic_kitchens/
│    │	 └───annotations
│    │	 └───features   
│    └───...
|
└───libs
│
│   ...

Training and Evaluation

On EPIC Kitchens, we train separate models for nouns and verbs.
To train our ActionFormer on verbs with SlowFast features, use

python ./train.py ./configs/epic_slowfast_verb.yaml --output reproduce

To train our ActionFormer on nouns with SlowFast features, use

python ./train.py ./configs/epic_slowfast_noun.yaml --output reproduce

Evaluate the trained model for verbs. The expected average mAP should be around 23.4(%) as in Table 2 of our main paper.

python ./eval.py ./configs/epic_slowfast_verb.yaml ./ckpt/epic_slowfast_verb_reproduce

Evaluate the trained model for nouns. The expected average mAP should be around 21.9(%) as in Table 2 of our main paper.

python ./eval.py ./configs/epic_slowfast_noun.yaml ./ckpt/epic_slowfast_noun_reproduce

Training our model on EPIC Kitchens requires ~4.5GB GPU memory, yet the inference might require over 10GB GPU memory. We recommend using a GPU with at least 12 GB of memory.

[Optional] Evaluating Our Pre-trained Model

We also provide a pre-trained model for EPIC-Kitchens 100. The model with all training logs can be downloaded from this Google Drive link (verb), and from this Google Drive link (noun). To evaluate the pre-trained model, please follow the steps listed below.

Create a folder ./pretrained and unpack the file under ./pretrained (or elsewhere and link to ./pretrained).
The folder structure should look like

This folder
│   README.md
│   ...  
│
└───pretrained/
│    └───epic_slowfast_verb_reproduce/
│    │	 └───epic_slowfast_verb_reproduce_log.txt
│    │	 └───epic_slowfast_verb_reproduce_results.txt
│    │   └───...   
│    └───epic_slowfast_noun_reproduce/
│    │	 └───epic_slowfast_noun_reproduce_log.txt
│    │	 └───epic_slowfast_noun_reproduce_results.txt
│    │   └───...  
│    └───...
|
└───libs
│
│   ...

The training config is recorded in ./pretrained/epic_slowfast_(verb|noun)_reproduce/config.txt.
The training log is located at ./pretrained/epic_slowfast_(verb|noun)reproduce/epic_slowfast(verb|noun)_reproduce_log.txt and also ./pretrained/epic_slowfast_(verb|noun)_reproduce/logs.
The pre-trained model is ./pretrained/epic_slowfast_(verb|noun)reproduce/epoch(020|020).pth.tar.
Evaluate the pre-trained model for verbs.

python ./eval.py ./configs/epic_slowfast_verb.yaml ./pretrained/epic_slowfast_verb_reproduce/

Evaluate the pre-trained model for nouns.

python ./eval.py ./configs/epic_slowfast_noun.yaml ./pretrained/epic_slowfast_noun_reproduce/

The results (mAP at tIoUs) should be

Method	0.1	0.2	0.3	0.4	0.5	Avg
ActionFormer (verb)	26.58	25.42	24.15	22.29	19.09	23.51
ActionFormer (noun)	25.21	24.11	22.66	20.47	16.97	21.88

To Reproduce Our Results on Ego4D Moment Queries Benchmark

Download Features and Annotations

Download the official SlowFast and Omnivore features from the Ego4D website and the official EgoVLP features from this link. Please note that we are not authorized to release the features and annotations. Instead, we provide our script for feature and annotation conversion at ./tools/convert_ego4d_trainval.py.

Details: All features are extracted at 1.875 fps from videos at 30 fps. This gives one feature vector per ~0.5333 seconds. Please refer to Ego4D and EgoVLP's documentation for more details on feature extraction.

Unpack Features and Annotations

Unpack the file under ./data (or elsewhere and link to ./data).
The folder structure should look like

This folder
│   README.md
│   ...  
│
└───data/
│    └───ego4d/
│    │   └───annotations
│    │   └───slowfast_features
│    │   └───omnivore_features
│    │   └───egovlp_features  
│    └───...
|
└───libs
│
│   ...

Training and Evaluation

We provide config files for training ActionFormer with different feature combinations. For example, training on Omnivore and EgoVLP features will create an experiment folder under ./ckpt that stores training config, logs, and checkpoints.

python ./train.py ./configs/ego4d_omnivore_egovlp.yaml --output reproduce

[Optional] Monitor the training using TensorBoard

tensorboard --logdir=./ckpt/ego4d_omnivore_egovlp_reproduce/logs

Evaluate the trained model. The expected average mAP and Recall@1x, tIoU=0.5 should be around 22.0(%) and 40.0(%) respectively.

python ./eval.py ./configs/ego4d_omnivore_egovlp.yaml ./ckpt/ego4d_omnivore_egovlp_reproduce

Training our model on Ego4D with all three features requires ~4.5GB GPU memory, yet the inference might require over 10GB GPU memory. We recommend using a GPU with at least 12 GB of memory.

[Optional] Evaluating Our Pre-trained Model

We also provide pre-trained models for Ego4D trained with all feature combinations. The models with all training logs can be downloaded from this Google Drive link. To evaluate the pre-trained model, please follow the steps listed below.

Create a folder ./pretrained and unpack the file under ./pretrained (or elsewhere and link to ./pretrained).
An example of the folder structure should look like

This folder
│   README.md
│   ...  
│
└───pretrained/
│    └───ego4d_omnivore_egovlp_reproduce/
│    │   └───ego4d_omnivore_egovlp_reproduce_log.txt
│    │   └───ego4d_omnivore_egovlp_reproduce_results.txt
│    │   └───...   
│    └───...
|
└───libs
│
│   ...

The training config is recorded in ./pretrained/ego4d_omnivore_egovlp_reproduce/config.txt.
The training log is located at ./pretrained/ego4d_omnivore_egovlp_reproduce/ego4d_omnivore_egovlp_reproduce_log.txt and also ./pretrained/ego4d_omnivore_egovlp_reproduce/logs.
The pre-trained model is ./pretrained/ego4d_omnivore_egovlp_reproduce/epoch_010.pth.tar.
Evaluate the pre-trained model.

python ./eval.py ./configs/ego4d_omnivore_egovlp.yaml ./pretrained/ego4d_omnivore_egovlp_reproduce/

The results (mAP at tIoUs) should be

Method	0.1	0.2	0.3	0.4	0.5	Avg
ActionFormer (S)	20.09	17.45	14.44	12.46	10.00	14.89
ActionFormer (O)	23.87	20.78	18.39	15.33	12.65	18.20
ActionFormer (E)	26.84	23.86	20.57	17.19	14.54	20.60
ActionFormer (S+E)	27.98	24.46	21.21	18.56	15.60	21.56
ActionFormer (O+E)	27.99	24.94	21.94	19.05	15.98	21.98
ActionFormer (S+O+E)	28.26	24.69	21.88	19.35	16.28	22.09

The results (Recall@1x at tIoUs) should be

Method	0.1	0.2	0.3	0.4	0.5	Avg
ActionFormer (S)	52.25	45.84	40.60	36.58	31.33	41.32
ActionFormer (O)	54.63	48.72	43.03	37.76	33.57	43.54
ActionFormer (E)	59.53	54.39	48.97	42.75	37.12	48.55
ActionFormer (S+E)	59.96	53.75	48.76	44.00	38.96	49.09
ActionFormer (O+E)	61.03	54.15	49.79	45.17	39.88	49.99
ActionFormer (S+O+E)	60.85	54.16	49.60	45.12	39.87	49.92

Training and Evaluating Your Own Dataset

Work in progress. Stay tuned.

Contact

Original Authors: Yin Li (yin.li@wisc.edu)
This Fork: sreevadde/actionformer_release

References

Please cite the original ActionFormer paper:

@inproceedings{zhang2022actionformer,
  title={ActionFormer: Localizing Moments of Actions with Transformers},
  author={Zhang, Chen-Lin and Wu, Jianxin and Li, Yin},
  booktitle={European Conference on Computer Vision},
  pages={492-510},
  year={2022}
}

If you cite results on Ego4D, please also cite the tech report:

@article{mu2022actionformerego4d,
  title={Where a Strong Backbone Meets Strong Features -- ActionFormer for Ego4D Moment Queries Challenge},
  author={Mu, Fangzhou and Mo, Sicheng and Wang, Gillian, and Li, Yin},
  journal={arXiv e-prints},
  year={2022}
}

If you are using TSP features, please cite

@inproceedings{alwassel2021tsp,
  title={{TSP}: Temporally-sensitive pretraining of video encoders for localization tasks},
  author={Alwassel, Humam and Giancola, Silvio and Ghanem, Bernard},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops},
  pages={3173--3183},
  year={2021}
}

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.github/workflows		.github/workflows
actionformer		actionformer
configs		configs
docs		docs
tools		tools
.gitignore		.gitignore
FAQ.md		FAQ.md
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
pyproject.toml		pyproject.toml
setup.py		setup.py
teaser.jpg		teaser.jpg
test_improvements.py		test_improvements.py
train.py		train.py
train_ddp.py		train_ddp.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ActionFormer: Localizing Moments of Actions with Transformers

Fork Enhancements

Quick Start

Multi-GPU Training (DDP)

Modernized Transformer Architecture (v2)

v2 Configuration

Detection Quality Improvements

SnapFormer (Point Detection)

TBTFormer (Boundary Distribution)

Framework Utilities

Original Implementation

Introduction

Changelog

Code Overview

Installation

Frequently Asked Questions

To Reproduce Our Results on THUMOS14

To Reproduce Our Results on ActivityNet 1.3

To Reproduce Our Results on EPIC Kitchens 100

To Reproduce Our Results on Ego4D Moment Queries Benchmark

Training and Evaluating Your Own Dataset

Contact

References

About

Uh oh!

Releases 7

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ActionFormer: Localizing Moments of Actions with Transformers

Fork Enhancements

Quick Start

Multi-GPU Training (DDP)

Modernized Transformer Architecture (v2)

v2 Configuration

Detection Quality Improvements

SnapFormer (Point Detection)

TBTFormer (Boundary Distribution)

Framework Utilities

Original Implementation

Introduction

Changelog

Code Overview

Installation

Frequently Asked Questions

To Reproduce Our Results on THUMOS14

To Reproduce Our Results on ActivityNet 1.3

To Reproduce Our Results on EPIC Kitchens 100

To Reproduce Our Results on Ego4D Moment Queries Benchmark

Training and Evaluating Your Own Dataset

Contact

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages