Skip to content

Data-Wrangling-and-Visualization-2026/CineGraph

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

98 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


CineGraph

A data-driven web application that processes raw subtitles from 40,000 movies to generate interactive "Emotional Seismographs"
View Demo · Report Bug · Request Feature

Table of Contents
  1. About The Project
  2. Getting Started
  3. Data pipeline
  4. Roadmap
  5. Contributing
  6. License
  7. Contact
  8. Acknowledgments

About The Project

We are building CineGraph, a data-driven web application that processes raw subtitles from 40,000 movies to generate interactive ”Emotional Seismographs”.

Users can inspect emotionally close movies and select a film to view its sentiment arc (consisting of 6 main emotions) over time. The system combines a massive scraping pipeline with NLP analysis and advanced clustering techniques to reveal the hidden structure of storytelling.

Main subtitle source: SubsLikeScript

(back to top)

Built With

  • Python
  • LangChain
  • PostgreSQL
  • Selenium
  • Docker

(back to top)

Project Structure

Here is an overview of the core structure and module responsibilities:

CINEGRAPH/
├── .venv/                     # Virtual environment
├── src/
│   ├── backend/               # Main Backend application
│   │   ├── api/               # API endpoints and routing
│   │   ├── clustering/        # Graph creation and dataset processing algorithms
│   │   │   ├── graph_creator.py
│   │   │   └── utils.py
│   │   ├── data/              # Local data/storage dumps
│   │   ├── db/                # Database configurations, models, and migrations
│   │   │   ├── base.py
│   │   │   └── session.py
│   │   ├── emotion_analysis/  # NLP analysis models, embeddings, and weights
│   │   │   └── model.py
│   │   ├── experiments/       # Sandbox for testing scripts and models
│   │   ├── preprocessing/     # LangChain agents for data cleaning
│   │   │   └── preprocessing_agent.py
│   │   ├── scraping/          # Selenium pipeline for pulling raw subtitles
│   │   │   ├── scraper.py
│   │   │   └── utils.py
│   │   ├── services/          # Core business logic and external integrations
│   │   ├── .env               # Backend specific environment variables
│   │   ├── dockerfile         # Dockerfile for backend service
│   │   ├── main.py            # Application entry point (Pipeline + FastAPI)
│   │   ├── requirements.txt   # Backend-specific dependencies
│   │   └── settings.py        # App configuration settings
│   └── infra/                 # Infrastructure and Orchestration
│       ├── .env               # Infrastructure specific environment variables
│       └── docker-compose.yml # Docker Compose to spin up app and PostgresDB
├── .gitignore
├── README.md                  # Project documentation
└── requirements_full.txt      # Global project dependencies

(back to top)

Getting Started

To get a local copy up and running follow these simple steps.

Prerequisites

Ensure you have the following installed on your machine:

Installation

  1. Clone the repo

    git clone https://github.com/Data-Wrangling-and-Visualization-2026/CineGraph.git
    cd CineGraph
  2. Environment Variables: Create your .env files. You will need to populate both src/backend/.env and src/infra/.env with the following template:

    # List of proxies for scraping
    IP_1=
    IP_2=
    IP_N=
    PROXY_PORT=3128
    
    DB_URL="postgresql+asyncpg://{user}:{password}@{host}:{port}/movies"
    
    API_PORT=5555
    FRONT_PORT=5173
  3. Spin up the infrastructure and application using Docker Compose:

    cd src/infra
    docker-compose up --build
  4. (Optional) If you wish to run the app natively without Docker:

    python -m venv .venv
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
    pip install -r requirements_full.txt

    For the frontend:

    cd src/frontend
    npm install
    npm run dev

(back to top)

Data pipeline

  • Scraping: The Selenium scraper (src/backend/scraping/scraper.py) utilizes the proxy list from your .env to scrape scripts from subslikescript.com extending request rate limitation (of course, with all the grace for the source website). The pipeline will also work without proxy servers.
  • Preprocessing: Raw subtitles are passed to the LangChain agent (src/backend/preprocessing/preprocessing_agent.py) to clean the original text.
  • Emotion Analysis: The NLP model (src/backend/emotion_analysis/model.py) evaluates the emotional trajectory of the subtitle windows. The model outputs embedding with 6 emotion intensities.
  • Clustering & Graphing: Graph modules process the data and saves it in tree-based format to PostgreSQL.

(back to top)

Roadmap

  • Build Selenium scraping pipeline for SubsLikeScript
  • Integrate LangChain for data preprocessing
  • Implement Emotion Analysis with NLP model
  • Integrate PostgreSQL
  • Complete graph clustering algorithm
  • Design & Implement main API
  • Build interactive web frontend for the Graph representation
  • Build interactive Web Frontend for the "Emotional Seismographs"

See the open issues for a full list of proposed features (and known issues).

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement".

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feat/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feat/AmazingFeature)
  5. Open a Pull Request

Additionally, it is highly recommended to follow Conventional Commits style guide.

(back to top)

License

Distributed under the MIT License. See LICENSE for more information.

(back to top)

Contact

Maybe, we will add it later...

(back to top)

Acknowledgments

(back to top)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors