GitHub - Data-Wrangling-and-Visualization-2026/CineGraph

CineGraph

A data-driven web application that processes raw subtitles from 40,000 movies to generate interactive "Emotional Seismographs"
View Demo · Report Bug · Request Feature

Table of Contents

About The Project
- Built With
- Project Structure
Getting Started
- Prerequisites
- Installation
Data pipeline
Roadmap
Contributing
License
Contact
Acknowledgments

About The Project

We are building CineGraph, a data-driven web application that processes raw subtitles from 40,000 movies to generate interactive ”Emotional Seismographs”.

Users can inspect emotionally close movies and select a film to view its sentiment arc (consisting of 6 main emotions) over time. The system combines a massive scraping pipeline with NLP analysis and advanced clustering techniques to reveal the hidden structure of storytelling.

Main subtitle source: SubsLikeScript

(back to top)

Built With

(back to top)

Project Structure

Here is an overview of the core structure and module responsibilities:

CINEGRAPH/
├── .venv/                     # Virtual environment
├── src/
│   ├── backend/               # Main Backend application
│   │   ├── api/               # API endpoints and routing
│   │   ├── clustering/        # Graph creation and dataset processing algorithms
│   │   │   ├── graph_creator.py
│   │   │   └── utils.py
│   │   ├── data/              # Local data/storage dumps
│   │   ├── db/                # Database configurations, models, and migrations
│   │   │   ├── base.py
│   │   │   └── session.py
│   │   ├── emotion_analysis/  # NLP analysis models, embeddings, and weights
│   │   │   └── model.py
│   │   ├── experiments/       # Sandbox for testing scripts and models
│   │   ├── preprocessing/     # LangChain agents for data cleaning
│   │   │   └── preprocessing_agent.py
│   │   ├── scraping/          # Selenium pipeline for pulling raw subtitles
│   │   │   ├── scraper.py
│   │   │   └── utils.py
│   │   ├── services/          # Core business logic and external integrations
│   │   ├── .env               # Backend specific environment variables
│   │   ├── dockerfile         # Dockerfile for backend service
│   │   ├── main.py            # Application entry point (Pipeline + FastAPI)
│   │   ├── requirements.txt   # Backend-specific dependencies
│   │   └── settings.py        # App configuration settings
│   └── infra/                 # Infrastructure and Orchestration
│       ├── .env               # Infrastructure specific environment variables
│       └── docker-compose.yml # Docker Compose to spin up app and PostgresDB
├── .gitignore
├── README.md                  # Project documentation
└── requirements_full.txt      # Global project dependencies

(back to top)

Getting Started

To get a local copy up and running follow these simple steps.

Prerequisites

Ensure you have the following installed on your machine:

Docker Desktop
Python 3.11+ (if running natively)

Installation

Clone the repo

git clone https://github.com/Data-Wrangling-and-Visualization-2026/CineGraph.git
cd CineGraph

Environment Variables: Create your .env files. You will need to populate both src/backend/.env and src/infra/.env with the following template:

# List of proxies for scraping
IP_1=
IP_2=
IP_N=
PROXY_PORT=3128

DB_URL="postgresql+asyncpg://{user}:{password}@{host}:{port}/movies"

API_PORT=5555
FRONT_PORT=5173

Spin up the infrastructure and application using Docker Compose:
```
cd src/infra
docker-compose up --build
```

(Optional) If you wish to run the app natively without Docker:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements_full.txt

For the frontend:

cd src/frontend
npm install
npm run dev

(back to top)

Data pipeline

Scraping: The Selenium scraper (src/backend/scraping/scraper.py) utilizes the proxy list from your .env to scrape scripts from subslikescript.com extending request rate limitation (of course, with all the grace for the source website). The pipeline will also work without proxy servers.
Preprocessing: Raw subtitles are passed to the LangChain agent (src/backend/preprocessing/preprocessing_agent.py) to clean the original text.
Emotion Analysis: The NLP model (src/backend/emotion_analysis/model.py) evaluates the emotional trajectory of the subtitle windows. The model outputs embedding with 6 emotion intensities.
Clustering & Graphing: Graph modules process the data and saves it in tree-based format to PostgreSQL.

(back to top)

Roadmap

Build Selenium scraping pipeline for SubsLikeScript
Integrate LangChain for data preprocessing
Implement Emotion Analysis with NLP model
Integrate PostgreSQL
Complete graph clustering algorithm
Design & Implement main API
Build interactive web frontend for the Graph representation
Build interactive Web Frontend for the "Emotional Seismographs"

See the open issues for a full list of proposed features (and known issues).

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement".

Fork the Project
Create your Feature Branch (git checkout -b feat/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feat/AmazingFeature)
Open a Pull Request

Additionally, it is highly recommended to follow Conventional Commits style guide.

(back to top)

License

Distributed under the MIT License. See LICENSE for more information.

(back to top)

Contact

Maybe, we will add it later...

(back to top)

Acknowledgments

SubsLikeScript - Main Subtitle Source
LangChain - LLM Application Framework

(back to top)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CineGraph

About The Project

Built With

Project Structure

Getting Started

Prerequisites

Installation

Data pipeline

Roadmap

Contributing

License

Contact

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
.github		.github
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements_full.txt		requirements_full.txt

Folders and files

Latest commit

History

Repository files navigation

CineGraph

About The Project

Built With

Project Structure

Getting Started

Prerequisites

Installation

Data pipeline

Roadmap

Contributing

License

Contact

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages