KUBIG Contest
📄 Project Presentation (PDF)
| 🎵 Thunder.wav + "a beach" | 🎵 Thunder.wav + "a city" |
|---|---|
![]() |
![]() |
Generate various scenes with the same audio and text
Input: Audio file (.wav) + Text prompt
Output: 512x512 image
CLAP2Diffusion is a model that generates images from audio and text inputs. By applying 3-stage hierarchical decomposition and Norm 60 optimization to the SonicDiffusion architecture, it visually transforms audio characteristics.
- Hierarchical Audio Processing: 3-stage decomposition into foreground/background/atmosphere
- Norm 60 Optimization: Experimentally discovered optimal normalization value
- Temperature Annealing: Progressive improvement from 2.0 → 0.5
- 4x Self-Attention: Enhanced audio token generation
AudioCaps: Audio-text paired dataset extracted from YouTube videos
- Includes various everyday sounds and environmental audio
- Provides natural language captions for each audio
- Train/Val/Test splits for training and evaluation
Transforming thunder sound into various scenes:
🔊 Audio Playback: Thunder.wav
Audio Player (Click to play)
| Audio | Text Prompt | Generated Image |
|---|---|---|
| Thunder.wav | "a beach" | ![]() |
| Thunder.wav | "a city" | ![]() |
| Thunder.wav | "a forest" | ![]() |
Human voices (laughter, etc.) are not properly generated:
🔊 Audio Samples: laughing_baby.wav | laughing_man.wav
- Audio Projector: 2.2M parameters
- Hierarchical Decomposer: 0.3M parameters
- Inference Speed: ~2 seconds/image (with GPU)
- Memory: ~6GB VRAM
Pre-trained models included in checkpoints/ folder:
audio_projector_stage1.pth: Stage 1 modelaudio_projector_stage2.pth: Stage 2 modelaudio_projector_stage3_finetuned.pth: Final model
docker-compose up --buildconda create -n clap2diffusion python=3.10
conda activate clap2diffusion
pip install -r requirements.txt
python app/gradio_app.py# Stage 1: Audio Projector (3,000 steps)
python scripts/train_stage1.py
# Stage 2: Full model (2,000 steps)
python scripts/train_stage2.py
# Stage 3: Fine-tuning (1,000 steps)
python scripts/train_stage3.pyMIT License
- SonicDiffusion (2023): "SonicDiffusion: Audio-Driven Image Generation and Editing with Pretrained Diffusion Models"
- AudioLDM 2 (2023): "AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining"
- CLAP (2023): "Large-scale Contrastive Language-Audio Pre-training with Feature Fusion and Keyword-to-Caption Augmentation"
- Stable Diffusion (2022): "High-Resolution Image Synthesis with Latent Diffusion Models"
- AudioCaps (2019): "AudioCaps: Generating Captions for Audios in The Wild"





