Traditional lip-syncing methods rely heavily on audio to guide mouth movements. But what happens when audio is missing, corrupted, or unavailable—such as in dubbing, translation, or accessibility scenarios?
LipGANs is my attempt to solve this problem by generating realistic lip movements without using audio at all. Instead, it leverages GANs (Generative Adversarial Networks) to map text or phoneme sequences directly into lip image frames. This makes the project unique and versatile, since no fixed timestamps or speech waveforms are required.
Deaf and hard-of-hearing users can type in words and visually learn how lip shapes look when spoken.
When creating dubbed movies or multilingual content, we often only have translated text and not clean audio. This system enables generating lip-synced visuals directly from text.
In cases where recordings are damaged, this approach still allows realistic lip generation without needing the original sound.
Unlike traditional lip-sync models that are audio-first, this project explores a text-to-visual pipeline. It introduces a way to generate synchronized mouth movements even in the absence of audio, bridging accessibility and entertainment needs in a novel way.
This project showcases the potential of generative AI for:
- Inclusive communication
- Cross-language dubbing
- Accessible education
LipGANs aims to reshape how we think about speech visualization—making it more inclusive, adaptable, and resilient.
Text → Phonemes → Predicted Durations → Visemes → GANs → Frames → Video
- Audio-free lip generation → Converts raw text directly into viseme-based animations.
- Phoneme-to-Viseme Mapping → Maps linguistic units to 10 distinct mouth shapes.
- Per-Viseme GAN Training → A separate 3D Convolutional GAN is trained for each viseme class.
- Automatic Dataset Preprocessing → Segmentation, lip ROI extraction, normalization.
- Built on TCD-TIMIT dataset → Aligned audiovisual dataset for speech-driven lip synthesis.
lipgans/
├─ README.md # Project documentation
├─ requirements.txt # Python dependencies
├─ .gitignore # Git ignore rules
├─ config/
│ └─ paths.example.yaml # Example YAML for setting dataset and model paths
├─ src/
│ └─ lipgans/
│ ├─ __init__.py
│ ├─ config.py # Config options: paths, latent dims, FPS, frame size
│ ├─ phonemes.py # Functions to convert word → phonemes → visemes
│ ├─ data/ # Dataset preprocessing utilities
│ │ ├─ mlf_parser.py # Parses TCD-TIMIT phoneme MLF files
│ │ ├─ extract_viseme_clips.py # Segments video/audio into per-viseme clips
│ │ ├─ crop_mouth.py # Crops mouth ROI from frames
│ │ └─ dataset.py # Dataset helper: load & organize clips for GAN training
│ ├─ models/
│ │ └─ gan3d.py # 3D convolutional GAN architecture per viseme
│ ├─ train/
│ │ └─ train_viseme.py # Script to train a single viseme GAN
│ ├─ generate/
│ │ ├─ merge_gans.py # Load per-viseme GANs, generate frames, save PNG/GIF/MP4
│ │ └─ frontend.py # Optional GUI / interface to generate words interactively
│ └─ utils/
│ ├─ io.py # File I/O helpers
│ ├─ video.py # Video assembling & frame handling helpers
│ └─ seed.py # Random seed initialization for reproducibility
├─ scripts/ # High-level scripts for batch processing or experiments
│ ├─ extract_all.py # Slice all videos into per-viseme clips
│ ├─ crop_all.py # Crop mouth regions for all dataset videos
│ ├─ train_all.py # Train GANs for all viseme classes
│ ├─ generate_word.py # Generate lip animation for a single word
│ └─ preview_crops.py # Quick preview of cropped mouth ROIs
└─ examples/ # Example outputs
└─ demo_words.txt # List of example words for demo generation
git clone https://github.com/your-username/lipgans.git
cd lipganspython -m venv venv
source venv/bin/activate # Linux/Mac
venv\Scripts\activate # Windowspip install -r requirements.txtDependencies include:
- TensorFlow / Keras
- NumPy, OpenCV, Imageio
- MediaPipe (for lip landmark detection)
- ffmpeg (for slicing & assembling clips)
- NLTK (for CMU Pronouncing Dictionary)
-
Download TCD-TIMIT dataset manually:
TCD-TIMIT Dataset -
Place it under:
data/raw/
-
Run preprocessing scripts:
python src/lipgans/data/extract_viseme_clips.py python src/lipgans/data/crop_mouth.py
This will:
- Segment videos into phoneme-aligned clips.
- Extract mouth regions using MediaPipe FaceMesh.
- Map phonemes → visemes (10 classes).
- Save normalized 3-frame 64×64 sequences into
data/viseme_xx/.
A viseme is any of several speech sounds that look the same on the lips, for example when lip reading.
Unlike phonemes (the smallest units of sound in language), visemes represent groups of phonemes that appear visually identical on the face when spoken.
👉 Example:
- The phonemes
/p/,/b/, and/m/all map to the same viseme (closed lips).
This is why phoneme-to-viseme mapping is essential for lip animation:
- It reduces complexity.
- It ensures natural-looking articulation.
📌 Example mapping (simplified):
| Viseme Class | Example Phonemes | Lip Shape Description |
|---|---|---|
| Closed Lips | /p/, /b/, /m/ | Lips fully closed |
| Teeth Touching | /t/, /d/ | Tongue touches teeth |
| Open Mouth (wide) | /a/, /aa/ | Jaw dropped, lips open wide |
| Rounded Lips | /oo/, /uw/, /w/ | Lips rounded forward |
Train a GAN for a specific viseme class:
python src/training/train.py --viseme_id 03 --epochs 200--viseme_id: Viseme class (01–10).--epochs: Number of training epochs (default = 200).
Trained models will be stored in:
models/viseme_xx/
The output is a sequence of generated frames (PNG), which can also be saved as GIF or MP4.
python src/lipgans/generate/generate_word.pySteps performed:
- Text → Phonemes (using CMU Pronouncing Dictionary).
- Phonemes → Visemes (via
viseme_mapping.json). - GAN Generation: Loads each viseme GAN and generates 3-frame clips.
- Chaining & Smoothing: Concatenates clips with temporal blending.
Output saved in:
example/cat/
├─ cat_01.png
├─ cat_02.png
├─ cat_03.png
├─ ...
├─ cat.gif
└─ cat.mp4
| Approach | Output Quality |
|---|---|
| Single Multi-Class GAN | Blurry, frequent mode collapse |
| Per-Viseme GANs (ours) | Sharper details, stable articulation |
✅ Generated clips show accurate viseme realization and plausible articulation across unseen speakers.
- 🎭 Virtual Avatars & Chatbots → Realistic mouth articulation in animated characters.
- 🗣 Speech Therapy Tools → Helping learners visualize correct articulation.
- 🦻 Assistive Technology for the Deaf/Hard of Hearing →
Deaf children (or learners with hearing difficulties) can simply type a word/sentence into the UI and see a sequence of lip movements (frames or animation) showing how it would be spoken. This bridges the gap between written text and spoken articulation. - 🎮 Gaming & AR/VR → Lifelike lip-syncing for immersive experiences. Can be used by animated characters
- 🎬 Audio Dubbing & Localization → Generate realistic lip movements that match translated text for films, shows, and animations.
- 🔹 Speaker-conditioned GANs (identity preservation).
- 🔹 Variable-length viseme clips for realistic timing.
- 🔹 Quantitative evaluation using FVD, lip-reading accuracy.
- 🔹 Multilingual support (phoneme mappings for other languages).
- 🔹 Real-time integration for virtual avatars and chatbots.
- 🔹 Integration with dubbing & localization pipelines for film and media industries.
Contributions are welcome!
- Fork the repo
- Create a new branch (
feature-xyz) - Commit your changes
- Open a Pull Request 🚀
This project is licensed under the MIT License – see LICENSE for details.
If you use this project in your research, please cite:
@misc{lipgans2025,
author = {Nandita Singh},
title = {LipGANs: Text-to-Viseme GAN Framework for Audio-Free Lip Animation Generation},
year = {2025},
url = {https://github.com/madebynanditaaa/lipgans}
}✨ With LipGANs, we take the first step towards speech-free, text-driven lip animation for next-gen human–computer interaction and accessibility!