A hardware-software co-design system for event-based object detection using Dynamic Vision Sensors (DVS). The software trains a recurrent encoder to compress raw event streams into compact spatial representations, which are fed into a YOLOX-based object detector. The hardware implements the encoder on a Xilinx FPGA for edge deployment.
RecRepEvent/
├── SW/ # Software component
│ ├── config/ # YAML configuration files (general, gen1, gen4)
│ ├── data/ # Dataset loaders and preprocessing
│ ├── models/ # Encoder, decoder, detection models
│ ├── representations/ # Event representation implementations
│ ├── utils/ # Packing, quantization, scheduling utilities
│ ├── YOLOX/ # Integrated YOLOX detection framework
│ ├── checkpoints/ # Saved model weights
│ ├── train_encoder.py # Autoencoder training (float32 + QAT)
│ ├── train.py # Object detection training
│ ├── test.py # Evaluation script
│ ├── preprocess_rep.py # Offline representation preprocessing
│ ├── visualisation.py # Encoder embedding visualization
│ └── Dockerfile # CUDA 12.8 + cuDNN container
│
└── HW/ # Hardware component
├── EventRecRep.xpr # Xilinx Vivado project
└── EventRecRep.srcs/ # HDL source files and IP cores
Raw DVS events (t, x, y, p) are packed per-pixel and processed by a 3-layer stacked MyGRUCell:
Raw Events (t, x, y, p)
→ pack_events_parallel() # Sort by (pixel, time); shape: (L, K, 2)
→ MyGRU Encoder (3 layers, hidden=12)
→ unpack_embeddings() # Sparse → dense; shape: (12, H, W)
The 12-channel embedding is the input to the detection backbone.
GRU Cell Equations:
Where
A ResNet18/34/50 backbone (modified for 12-channel input) feeds into a YOLOX detection head trained to detect car and pedestrian classes on the Gen1 dataset.
After float32 pretraining, the encoder is fine-tuned with 8-bit QAT for integer-only FPGA inference. Observer modules track activation statistics; LUTs replace sigmoid/tanh for hardware-friendly computation.
The representations/ module supports multiple event-to-frame conversion strategies:
| Name | Channels | Description |
|---|---|---|
Encoder |
12 | Recurrent GRU embedding (trained) |
ToImage |
2 | Polarity-separated accumulation |
ToVoxelGrid |
12 | Temporal binning |
MixedDensityEventStack |
12 | Multi-window aggregation (max, sum, mean, var) |
ToTimesurface |
2 | Exponential decay time surface |
TORE |
— | Temporal Ordering Representation |
cd SW
docker build -t rec-rep .
docker run -it --rm --gpus all -v $(pwd):/workspace -w /workspace rec-repconda create -n dvs_rec python=3.10
conda activate dvs_rec
# PyTorch nightly with CUDA 12.8
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128
# HDF5 support
conda install h5py
conda install blosc-hdf5-plugin -c conda-forge
# Other dependencies
pip install omegaconf opencv-python matplotlib psutil wandb lightning tonic numba pycocotools
# YOLOX
git clone https://github.com/Megvii-BaseDetection/YOLOX
cd YOLOX && pip install -v -e .Trains a recurrent autoencoder on raw event streams, then fine-tunes with QAT:
python train_encoder.py --config config/gen1.yamlTraining proceeds in two stages:
- Stage 1: Float32 training for 100 epochs (MSE loss on reconstructed timestamps and polarity)
- Stage 2: QAT fine-tuning for 10 epochs with 8-bit quantization
| Parameter | Value |
|---|---|
| Optimizer | Adam |
| Learning rate | 0.001 |
| Weight decay | 0.0001 |
| Validation interval | 2 epochs |
| Loss | MSE (time ×1.0 + polarity ×0.1) |
Pre-compute and cache event representations to disk for faster detection training:
python preprocess_rep.py --config config/gen1.yamlTrains YOLOX with a ResNet backbone on the selected representation:
python train.py --config config/gen1.yaml| Parameter | Value |
|---|---|
| Epochs | 100 |
| Warmup epochs | 5 |
| No-augmentation epochs | last 15 |
| Precision | 16-mixed |
| EMA decay | 0.9998 |
| Gradient clip | 1.0 |
| Metrics | mAP, mAP50, mAP75 |
Configs are in config/ using OmegaConf YAML:
# config/gen1.yaml (example)
name: Gen1
sensor_size: {W: 304, H: 240}
image_size: {W: 224, H: 224}
time_window: 200000 # microseconds
num_events: 50000
encoder:
rnn_type: MyGRU
hidden_size: [12, 12, 12]
num_bits: 8 # quantization bits
representation: Encoder # or ToImage, ToVoxelGrid, etc.
batch_size: 8
num_workers: 8The HW/ directory contains a Xilinx Vivado project implementing the quantized GRU encoder in RTL. The design uses:
- Distributed memory and arithmetic IP cores
- 8-bit fixed-point arithmetic matching the QAT model
- LUT-based sigmoid/tanh approximations
The code targets the Gen1 Prophesee event camera dataset (304×240 resolution). Data is stored in HDF5 format and loaded lazily via h5py. Experiment tracking is handled by Weights & Biases.
If you find the resources usefull, please cite the paper:
@inproceedings{jeziorek2025self,
title={Self-supervised event representations: towards accurate, real-time perception on SoC FPGAs},
author={Jeziorek, Kamil and Kryjak, Tomasz},
booktitle={Real-time Processing of Image, Depth, and Video Information 2025},
volume={13526},
pages={1352602},
year={2025},
organization={SPIE}
}