Latent Diffusion on CelebA

This repo implements a DiT-based conditional flow model heavily inspired by Stable Diffusion 3.

Our image generation pipeline has two main components:

A variational autoencoder (VAE) trained to encode 128×128 RGB CelebA images into a compact 16-channel, 16×16 latent space
A diffusion transformer (DiT) trained with classifier-free guidance to model the distribution of those latent codes conditioned on CelebA's 40 binary facial attributes (we do not use a text encoder to save on compute). At inference time, the DiT generates latent samples which are decoded by the VAE back to full-resolution faces.

References

William Peebles and Saining Xie. 2023. Scalable Diffusion Models with Transformers. Preprint, arXiv:2212.09748.

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. Preprint, arXiv:2112.10752.

Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. 2018. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. Preprint, arXiv:1801.03924.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
diffusion.ipynb		diffusion.ipynb
pyproject.toml		pyproject.toml
uv.lock		uv.lock