A proof-of-concept implementation of BLIP-2's vision-language alignment method adapted for video captioning. This project demonstrates how a frozen vision encoder (V-JEPA 2) and frozen language model (Qwen2-1.5B) can be efficiently aligned using a lightweight Q-Former module trained on the Something-Something-V2 dataset.
Here are some sample gifs with their predicted captions:
This approach:
- Uses Meta's V-JEPA 2 as the frozen vision encoder
- Uses Qwen2-1.5B as the frozen language model
- Trains only a Q-Former module (~40M parameters) to bridge vision and language modalities
- Achieves strong video captioning results with minimal trainable parameters
The key insight from BLIP-2 is that by keeping both the vision and language models frozen, we can learn an efficient alignment with far fewer parameters and compute than end-to-end training.
Architecture diagram from adapted from BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
This project adapts the BLIP-2 architecture for video captioning. The key adaptation is replacing single image input with 32-frame video sequences:
-
Video Encoder (Frozen): Instead of processing a single image through a frozen image encoder, we process 32-frame video clips through V-JEPA 2, Meta's video understanding model. This produces spatiotemporal features that capture both appearance and motion.
-
Q-Former (Trainable): The Q-Former module remains the core trainable component. It uses learned queries to extract the most relevant visual information from the video features, compressing the 32-frame sequence into a fixed set of query embeddings that bridge the vision and language modalities.
-
LLM Decoder (Frozen): The Q-Former outputs are fed into a frozen Qwen2-1.5B language model, which generates natural language captions describing the video content.
By keeping both the video encoder and language model frozen, we only need to train the lightweight Q-Former (~40M parameters) to learn the vision-language alignment. This makes the approach highly parameter-efficient while leveraging the strong representations from both pretrained models.
uv sync
# or
pip install -e .Download the Something-Something-V2 dataset from the official website.
Since V-JEPA 2 is frozen during training, precompute the features to avoid loading the vision model in VRAM and speed up training:
python preprocess_something_something_v2.pyUpdate the paths in the script to point to your dataset location. This will extract V-JEPA features for all videos and save them as .pt files.
Download the paraphrase augmentations from somethingsomethingv2-paraphrase and save as ssv2_paraphrase.json in the project root.
This provides additional language variation for training samples, exposing the model to more diverse captions and improving generalization.
Process the annotations to format all captions as lists:
python prepare_annotations.pyThis converts captions to list format (validation will have lists of length 1, training will include paraphrases) and simplifies random caption selection in the dataloader.
python train.pyThe model typically converges after 8-10 epochs. Training only the Q-Former (~4M parameters) while keeping V-JEPA and Qwen2 frozen makes this very efficient.
Configuration:
- Batch size: 8
- Gradient accumulation: 16 steps
- Learning rate: 5e-5
- Weight decay: 0.03
The script will:
- Save checkpoints to
checkpoint.ptafter each epoch - Log losses to
losses.txt - Save validation predictions to
results/validation_epoch_N.json
Run full evaluation with standard captioning metrics:
python evaluate.pyThis computes BLEU, METEOR, ROUGE-L, and CIDEr scores on the validation set... Caveat: CIDEr is the only one that I think is loosely worth paying attention to, since the validation set is not really designed for captioning. Validation "captions" for SSV2 are not really "human-like".
Generate captions for test set samples:
python infer.py --num-samples 10 --checkpoint checkpoint.ptResults are saved to results/ with both the generated captions and source videos.



