Skip to content

tkbarb10/ADS_505_Project

Repository files navigation

Project Description

This project analyzes Amazon customer review data to identify what makes a review helpful. Using data science and machine learning, we build predictive models to surface the most helpful reviews and apply topic modeling to uncover the key themes and characteristics of helpful reviews across product categories and ratings.

Problem Statement

The primary problem addressed in this project is: "What characteristics make an Amazon product review helpful?" By leveraging large-scale review datasets, the project seeks to identify linguistic, structural, and contextual features that correlate with helpfulness votes. The goal is twofold: (1) develop predictive models to identify reviews most likely to be found helpful, and (2) use topic modeling to understand how helpful review characteristics differ by category and rating. This helps users write better reviews and enables companies to highlight valuable feedback and gain insights for product or system improvements.

Getting Started

Follow these steps to set up your environment and get started:

  1. Clone the Repository

    git clone https://github.com/tkbarb10/ADS_505_Project.git
    cd ADS_505_Project
  2. Install uv (Dependency Manager)

    • If you don't have uv installed, run:
      pip install uv
  3. Create and Activate a Virtual Environment

    • Use uv to create and manage a virtual environment:
      uv venv
    • Activate the environment:
      • On Windows:
        .venv\Scripts\activate
      • On Mac/Linux:
        source .venv/bin/activate
  4. Install Project Dependencies

    • With the virtual environment activated, install all required packages:
      uv sync
    • This will use pyproject.toml and uv.lock to create a reproducible environment.

Package Overview & Special Requirements

Some packages require additional setup or downloads beyond a standard uv sync. Please review the following notes:

1. huggingface_hub / datasets

  • The utils/loading_script.py uses Hugging Face’s datasets library to download Amazon review data for each category.
  • By default, datasets are cached in the user’s ~/.cache/huggingface/datasets directory. You can change this location by setting the HF_DATASETS_CACHE environment variable.
  • The first run may take a while as the full dataset for each category is downloaded and cached locally.

2. nltk

  • Some functions require NLTK resources (e.g., stopwords) to be downloaded before use.
  • If you encounter errors about missing NLTK data, run the following in a Python shell:
     import nltk
     nltk.download('stopwords')
     nltk.download('punkt')
  • You may need to download additional corpora depending on your use (e.g., wordnet for lemmatization).

3. emoji

  • Used for demojizing text in reviews. No special setup required, but ensure the package is installed.

4. streamlit

  • If you want to use the Streamlit app (if provided), you’ll need to run:
     streamlit run <your_app_file.py>
  • No special setup, but ensure your environment includes all Streamlit dependencies.

5. scikit-learn, xgboost, seaborn, matplotlib, pandas, numpy

  • Standard data science stack. No special setup required beyond installation.

Project Structure Overview

This repository is organized to support efficient data analysis, modeling, and reproducibility. Below is a summary of the main components:

  • main_notebook.ipynb
    The primary Jupyter notebook containing the main analysis, modeling experiments, and results for the project.

  • utils/
    Utility scripts for data loading, preprocessing, HTML removal, topic modeling, and plotting.

    • loading_script.py: Functions for loading and processing review data.
    • remove_html.py: Removes HTML tags from text fields.
    • plot_topics.py: Visualization utilities for topic modeling results.
    • lda_script.py, __init__.py: Additional helpers for topic modeling and package structure.
  • topic_modeling.py
    Core functions for lemmatization and topic modeling workflows.

  • lemma_data/
    Preprocessed review data with lemmatized and HTML-stripped text, optimized for topic modeling.

  • practice_df.csv, trials.log
    Example data and logs from model training or hyperparameter optimization.

  • pyproject.toml, requirements.txt, uv.lock, poetry.lock
    Dependency and environment management files.

  • README.md, LICENSE, notes.txt
    Project documentation, license information, and development notes.

Data Dictionary

Column Name Type Description
rating int Star rating given by the reviewer (1–5).
title_x string Title of the review (raw/original, may be empty).
text string Full review text (raw/original, may be empty).
images_x list List of image URLs included in the review.
asin string Amazon Standard Identification Number for the product.
parent_asin string Parent ASIN for product grouping.
user_id string Unique identifier for the reviewer.
timestamp int Unix timestamp (milliseconds) of the review.
helpful_vote int Number of users who found the review helpful (target variable).
verified_purchase bool Whether the purchase was verified (True/False).
main_category string Main product category.
title_y string Product/item title from metadata.
average_rating float Average rating for the product.
rating_number int Number of ratings for the product.
features list List of product features.
description string Product description.
price float Price of the product at the time of review.
images_y list List of product image URLs from metadata.
videos list List of product video URLs from metadata.
store string Store or seller name.
categories list List of product categories.
details dict Additional product details (may be nested).
bought_together list List of products frequently bought together.
subtitle string Product subtitle.
author string Author or creator (if applicable).

*Note: Many of these columns are dropped or renamed during cleaning and feature engineering.

Review Data Directory Overview

The root directory contains pre-downloaded Amazon review data, prepared primarily to support efficient topic modeling workflows. Two main versions of the dataset are provided:

  • Raw Review Data:
    This is a cleaned version of the original Amazon review dataset, with non-essential columns (such as images, various IDs, and the "verified purchase" flag) removed. It is ready for direct use in topic modeling and can also be repurposed for predictive modeling tasks, as the title and text columns remain intact.

  • lemma_data Folder:
    This folder contains the same review data, but with the title and review text columns preprocessed—lemmatized and stripped of HTML tags. This preprocessed version is optimized for topic modeling, allowing for rapid experimentation without repeated text cleaning. The lemmatized text can also be used for downstream predictive modeling or other NLP tasks.

Key Points:

  • Both raw and preprocessed (lemmatized) versions are available for flexibility in modeling.
  • The main motivation for pre-downloading and preprocessing was to accelerate topic modeling experiments.
  • The title and text columns in both versions can be leveraged for predictive modeling as well.

Main Notebook Overview

The primary analysis for this project is contained in main_notebook.ipynb. This notebook walks through the full data science workflow, including:

  • Data loading and preprocessing (using scripts in the utils/ directory)
  • Exploratory data analysis and visualization
  • Feature engineering and text cleaning (including lemmatization and HTML removal)
  • Predictive modeling (classification of review helpfulness)
  • Topic modeling to uncover key themes in helpful reviews
  • Evaluation of model performance and interpretation of results

The notebook is designed to be reproducible and modular, with clear section headings and code comments. It leverages the utility scripts for efficient data handling and preprocessing, and provides both code and narrative to guide users through the analysis steps.

Authors

This project was developed by:

About

Sentiment analysis using Amazon customer reviews

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors 3

  •  
  •  
  •