SAFFE is a multimodal model composition framework designed for
pretrained encoder models and bi-modal fusion training.
It provides an efficient, streamlined pipeline with support for
single-GPU training and evaluation, making it suitable for
resource-constrained environments.
- 🔹 Designed for pretrained frozen encoders\
- 🔹 Bi-modal semantic-alignment fusion\
- 🔹 Single-GPU training & evaluation\
- 🔹 Lightweight and efficient pipeline\
- 🔹 Vector embedding dimension: 768
This implementation operates on:
- ImageNet-100 (Kaggle version)
To begin training SAFFE:
Run the notebook:
train.ipynbIf you use SAFFE in your research, please cite:
@article{SAFFE2025,
title={Saffe: Multimodal Model Composition with Semantic-Alignment Fusion of Frozen Encoders},
author={Kulasekara, M. and Ingl{\'e}s-Romero, J.F. and Imbern{\'o}n, B. and others},
journal={The Journal of Supercomputing},
volume={81},
pages={1114},
year={2025},
publisher={Springer},
doi={10.1007/s11227-025-07473-7}
}🔗 Paper Link: https://doi.org/10.1007/s11227-025-07473-7
This work was supported by:
- MICIU/AEI/10.13039/501100011033\
- European Union NextGenerationEU/PRTR\
- Grants: CNS2023-144241 and RYC2021-031966-I
- Maithri Ranga Kulasekara\
- J.F. Inglés-Romero\
- B. Imbernón\
- José L. Abellán