A data-driven web application that processes raw subtitles from 40,000 movies to generate interactive "Emotional Seismographs"
View Demo
·
Report Bug
·
Request Feature
Table of Contents
We are building CineGraph, a data-driven web application that processes raw subtitles from 40,000 movies to generate interactive ”Emotional Seismographs”.
Users can inspect emotionally close movies and select a film to view its sentiment arc (consisting of 6 main emotions) over time. The system combines a massive scraping pipeline with NLP analysis and advanced clustering techniques to reveal the hidden structure of storytelling.
Main subtitle source: SubsLikeScript
Here is an overview of the core structure and module responsibilities:
CINEGRAPH/
├── .venv/ # Virtual environment
├── src/
│ ├── backend/ # Main Backend application
│ │ ├── api/ # API endpoints and routing
│ │ ├── clustering/ # Graph creation and dataset processing algorithms
│ │ │ ├── graph_creator.py
│ │ │ └── utils.py
│ │ ├── data/ # Local data/storage dumps
│ │ ├── db/ # Database configurations, models, and migrations
│ │ │ ├── base.py
│ │ │ └── session.py
│ │ ├── emotion_analysis/ # NLP analysis models, embeddings, and weights
│ │ │ └── model.py
│ │ ├── experiments/ # Sandbox for testing scripts and models
│ │ ├── preprocessing/ # LangChain agents for data cleaning
│ │ │ └── preprocessing_agent.py
│ │ ├── scraping/ # Selenium pipeline for pulling raw subtitles
│ │ │ ├── scraper.py
│ │ │ └── utils.py
│ │ ├── services/ # Core business logic and external integrations
│ │ ├── .env # Backend specific environment variables
│ │ ├── dockerfile # Dockerfile for backend service
│ │ ├── main.py # Application entry point (Pipeline + FastAPI)
│ │ ├── requirements.txt # Backend-specific dependencies
│ │ └── settings.py # App configuration settings
│ └── infra/ # Infrastructure and Orchestration
│ ├── .env # Infrastructure specific environment variables
│ └── docker-compose.yml # Docker Compose to spin up app and PostgresDB
├── .gitignore
├── README.md # Project documentation
└── requirements_full.txt # Global project dependencies
To get a local copy up and running follow these simple steps.
Ensure you have the following installed on your machine:
- Docker Desktop
- Python 3.11+ (if running natively)
-
Clone the repo
git clone https://github.com/Data-Wrangling-and-Visualization-2026/CineGraph.git cd CineGraph -
Environment Variables: Create your
.envfiles. You will need to populate bothsrc/backend/.envandsrc/infra/.envwith the following template:# List of proxies for scraping IP_1= IP_2= IP_N= PROXY_PORT=3128 DB_URL="postgresql+asyncpg://{user}:{password}@{host}:{port}/movies" API_PORT=5555 FRONT_PORT=5173
-
Spin up the infrastructure and application using Docker Compose:
cd src/infra docker-compose up --build -
(Optional) If you wish to run the app natively without Docker:
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate pip install -r requirements_full.txt
For the frontend:
cd src/frontend npm install npm run dev
- Scraping: The Selenium scraper (
src/backend/scraping/scraper.py) utilizes the proxy list from your.envto scrape scripts from subslikescript.com extending request rate limitation (of course, with all the grace for the source website). The pipeline will also work without proxy servers. - Preprocessing: Raw subtitles are passed to the LangChain agent (
src/backend/preprocessing/preprocessing_agent.py) to clean the original text. - Emotion Analysis: The NLP model (
src/backend/emotion_analysis/model.py) evaluates the emotional trajectory of the subtitle windows. The model outputs embedding with 6 emotion intensities. - Clustering & Graphing: Graph modules process the data and saves it in tree-based format to
PostgreSQL.
- Build Selenium scraping pipeline for SubsLikeScript
- Integrate LangChain for data preprocessing
- Implement Emotion Analysis with NLP model
- Integrate PostgreSQL
- Complete graph clustering algorithm
- Design & Implement main API
- Build interactive web frontend for the Graph representation
- Build interactive Web Frontend for the "Emotional Seismographs"
See the open issues for a full list of proposed features (and known issues).
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement".
- Fork the Project
- Create your Feature Branch (
git checkout -b feat/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feat/AmazingFeature) - Open a Pull Request
Additionally, it is highly recommended to follow Conventional Commits style guide.
Distributed under the MIT License. See LICENSE for more information.
Maybe, we will add it later...
- SubsLikeScript - Main Subtitle Source
- LangChain - LLM Application Framework