This project was developed as part of the OEL AI Lab (Spring 2026). It implements a complete multimodal AI pipeline for video understanding and cinematic trailer generation.
The system performs:
- Character detection using YOLO11
- ROI-based visual transformation (darkening effect)
- High-impact scene analysis on pre-segmented clips
- Machine learning-based clip classification
- Automated trailer generation
- Creepy visual transformation effects
- NLP-based cinematic caption generation with TTS audio
- Atmospheric background music generation
- Final trailer export with mixed audio
The project is implemented using two separate Google Colab notebooks for modular execution.
This project is intended for academic learning and evaluation purposes.
- The code and methodology may be referenced for learning.
- Copying, modifying, or using this project in assignments or submissions requires prior permission from the author.
- Unauthorized use without consent is not permitted.
- Python
- YOLO11 (Ultralytics)
- OpenCV
- Scikit-learn
- MoviePy / FFmpeg
- HuggingFace Transformers (BLIP)
- Coqui TTS (tacotron2-DDC)
- Pillow
- Scipy (WAV audio I/O, pitch shifting, reverb processing)
- Matplotlib / NumPy / Pandas
cinematic-intelligence-system/
│
├── task1_character_detection_videos/
├── task2_trailer_input_videos/
├── 01_yolo_detection.ipynb
├── 02_task2_trailer_generation.ipynb
├── OEL LAB (AI).pdf
└── README.md- Input: Full-length video files (horizon.mp4, uncharted.mp4, need_for_speed.mp4, lotr.mp4)
- Purpose: Character detection and ROI-based darkening transformation
- Input: Pre-segmented video clips (numbered .mp4 files)
- Purpose: Feature extraction, classification, trailer generation, and cinematic post-processing
This project is fully designed for execution in Google Colab.
Open:
01_yolo_detection.ipynb02_task2_trailer_generation.ipynb
from google.colab import drive
drive.mount('/content/drive')pip install ultralytics opencv-python moviepy transformers scikit-learn matplotlib pandas numpy scipy Pillow TTS
apt-get install -y ffmpeg fonts-dejavuDetect characters in video frames, apply bounding boxes, extract regions of interest (ROIs), apply a darkening transformation, and reconstruct the output video.
- Load
yolo11n.ptand run inference on each frame - Detect all objects with confidence ≥ 0.5
- Draw green bounding boxes around all detected objects
- Crop the detected region (ROI) from each frame
- Apply a darkening effect (pixel values scaled to 50%)
- Reinsert the modified ROI back into the frame
- Save the processed video using OpenCV's VideoWriter
Processed videos:
output_detection.mp4(horizon)uncharted_output_detection.mp4needForSpeed_output_detection.mp4lotr_output_detection.mp4horizon(1)_output_detection.mp4
Task 2 works on pre-segmented clips, where each clip is independently analyzed and ranked for impact. No scene detection or video splitting is performed.
Each clip is sampled every 5 frames and converted into a feature vector:
- Motion score (mean absolute frame difference)
- Brightness mean and variance
- YOLO11 object count per frame
- Scene cut rate (large brightness jumps between frames)
- Audio energy and MFCC mean — zero-filled
Features are saved to features.csv.
Labels are generated automatically using a weighted composite impact score:
- Motion (40%), Objects (35%), Scene cuts (15%), Brightness variance (10%)
- Median split → +1 (high-impact) / -1 (low-impact)
Model used: Logistic Regression (L2-regularised, with StandardScaler)
- Stratified K-Fold cross-validation (up to 5 folds)
- Saved as
impact_model.pklandscaler.pkl
- Clips ranked by composite score (motion + objects)
- Top 5 clips selected using a greedy diversity constraint (at least 2 index positions apart)
- Clips ordered low → high impact for narrative suspense build
- Merged using MoviePy → saved as
trailer.mp4
Applied per-frame to trailer.mp4 → output: creepy_trailer.mp4
Effects include:
- Face darkening (upper 40% of person bounding box shadowed)
- Red glowing eyes (overlaid at estimated eye positions)
- Random blood drip lines (30% probability per person detection)
- Background desaturation (everything outside person boxes goes grayscale)
- Cinematic vignette (edge darkening)
- Brightness flicker (random ±15–20% per frame)
- Glitch shift (6% probability, horizontal pixel band displacement)
- Fog overlay (15% probability per frame)
Key frames are sampled from each selected clip and captioned using BLIP (Salesforce/blip-image-captioning-base).
Captions are transformed into cinematic horror-style text via:
- Word-level substitution dictionary (e.g. "man" → "figure", "walks" → "lurks")
- Random creepy prefix (e.g. "They were warned...")
- Random creepy suffix (e.g. "Something watches.")
Captions are rendered as text overlays onto creepy_trailer.mp4 → captioned_trailer.mp4.
Each caption is converted to speech, then processed with:
- Pitch-shifting (lower, creepier tone)
- Reverb effect
- Whisper-style amplitude envelope
Atmospheric background music is generated programmatically:
- Dark sine-wave drone (40Hz, 60Hz, 90Hz, 135Hz layers)
- Heartbeat pulse at ~62 BPM
Voice clips and drone are mixed and saved as final_audio.wav.
Video and audio are merged using FFmpeg:
captioned_trailer.mp4 + final_audio.wav → FINAL_TRAILER_PRODUCTION.mp4
Generated outputs include:
FINAL_TRAILER_PRODUCTION.mp4— final trailer with captions and audiocreepy_trailer.mp4— visually transformed trailertrailer.mp4— raw assembled trailer (5 clips)features.csv— extracted clip features and labelscreepy_music.wav— intensity-driven atmospheric music (per-clip)drone_bed.wav— background drone layer mixed under voiceimpact_model.pkl/scaler.pkl— trained classification model- YOLO detection output videos (Task 1)
All outputs are stored externally due to large file sizes:
🔗 Google Drive Link: https://drive.google.com/drive/folders/1LcevU8MvltBCN0XaVQAyoz4Z5IRDgusD?usp=drive_link
Input Videos / Clips
↓
Task 1: YOLO Detection + ROI Darkening
↓
Output Videos (per input file)
↓
Task 2: Feature Extraction (visual + object count)
↓
Logistic Regression Classification (+1 / -1)
↓
Top 5 High-Impact Clip Selection (diversity-aware)
↓
Trailer Assembly (narrative order: low → high impact)
↓
Creepy Visual Transformations (YOLO-guided, per-frame)
↓
BLIP Caption Generation + Horror NLP Transform
↓
Caption Overlay onto Video
↓
TTS Voice Generation + Atmospheric Music
↓
FFmpeg Audio/Video Merge
↓
FINAL_TRAILER_PRODUCTION.mp4
Romaisa | Maham Anjum | Malaika
AI Lab - Spring 2026 | BS Artificial Intelligence
This system demonstrates a full pipeline combining computer vision, machine learning, NLP, and audio synthesis for intelligent cinematic trailer generation.