This project analyzes Amazon customer review data to identify what makes a review helpful. Using data science and machine learning, we build predictive models to surface the most helpful reviews and apply topic modeling to uncover the key themes and characteristics of helpful reviews across product categories and ratings.
The primary problem addressed in this project is: "What characteristics make an Amazon product review helpful?" By leveraging large-scale review datasets, the project seeks to identify linguistic, structural, and contextual features that correlate with helpfulness votes. The goal is twofold: (1) develop predictive models to identify reviews most likely to be found helpful, and (2) use topic modeling to understand how helpful review characteristics differ by category and rating. This helps users write better reviews and enables companies to highlight valuable feedback and gain insights for product or system improvements.
Follow these steps to set up your environment and get started:
-
Clone the Repository
git clone https://github.com/tkbarb10/ADS_505_Project.git cd ADS_505_Project -
Install uv (Dependency Manager)
- If you don't have
uvinstalled, run:pip install uv
- If you don't have
-
Create and Activate a Virtual Environment
- Use
uvto create and manage a virtual environment:uv venv
- Activate the environment:
- On Windows:
.venv\Scripts\activate
- On Mac/Linux:
source .venv/bin/activate
- On Windows:
- Use
-
Install Project Dependencies
- With the virtual environment activated, install all required packages:
uv sync
- This will use
pyproject.tomlanduv.lockto create a reproducible environment.
- With the virtual environment activated, install all required packages:
Some packages require additional setup or downloads beyond a standard uv sync. Please review the following notes:
- The
utils/loading_script.pyuses Hugging Face’sdatasetslibrary to download Amazon review data for each category. - By default, datasets are cached in the user’s
~/.cache/huggingface/datasetsdirectory. You can change this location by setting theHF_DATASETS_CACHEenvironment variable. - The first run may take a while as the full dataset for each category is downloaded and cached locally.
- Some functions require NLTK resources (e.g., stopwords) to be downloaded before use.
- If you encounter errors about missing NLTK data, run the following in a Python shell:
import nltk nltk.download('stopwords') nltk.download('punkt')
- You may need to download additional corpora depending on your use (e.g.,
wordnetfor lemmatization).
- Used for demojizing text in reviews. No special setup required, but ensure the package is installed.
- If you want to use the Streamlit app (if provided), you’ll need to run:
streamlit run <your_app_file.py>
- No special setup, but ensure your environment includes all Streamlit dependencies.
- Standard data science stack. No special setup required beyond installation.
This repository is organized to support efficient data analysis, modeling, and reproducibility. Below is a summary of the main components:
-
main_notebook.ipynb
The primary Jupyter notebook containing the main analysis, modeling experiments, and results for the project. -
utils/
Utility scripts for data loading, preprocessing, HTML removal, topic modeling, and plotting.loading_script.py: Functions for loading and processing review data.remove_html.py: Removes HTML tags from text fields.plot_topics.py: Visualization utilities for topic modeling results.lda_script.py,__init__.py: Additional helpers for topic modeling and package structure.
-
topic_modeling.py
Core functions for lemmatization and topic modeling workflows. -
lemma_data/
Preprocessed review data with lemmatized and HTML-stripped text, optimized for topic modeling. -
practice_df.csv, trials.log
Example data and logs from model training or hyperparameter optimization. -
pyproject.toml, requirements.txt, uv.lock, poetry.lock
Dependency and environment management files. -
README.md, LICENSE, notes.txt
Project documentation, license information, and development notes.
| Column Name | Type | Description |
|---|---|---|
| rating | int | Star rating given by the reviewer (1–5). |
| title_x | string | Title of the review (raw/original, may be empty). |
| text | string | Full review text (raw/original, may be empty). |
| images_x | list | List of image URLs included in the review. |
| asin | string | Amazon Standard Identification Number for the product. |
| parent_asin | string | Parent ASIN for product grouping. |
| user_id | string | Unique identifier for the reviewer. |
| timestamp | int | Unix timestamp (milliseconds) of the review. |
| helpful_vote | int | Number of users who found the review helpful (target variable). |
| verified_purchase | bool | Whether the purchase was verified (True/False). |
| main_category | string | Main product category. |
| title_y | string | Product/item title from metadata. |
| average_rating | float | Average rating for the product. |
| rating_number | int | Number of ratings for the product. |
| features | list | List of product features. |
| description | string | Product description. |
| price | float | Price of the product at the time of review. |
| images_y | list | List of product image URLs from metadata. |
| videos | list | List of product video URLs from metadata. |
| store | string | Store or seller name. |
| categories | list | List of product categories. |
| details | dict | Additional product details (may be nested). |
| bought_together | list | List of products frequently bought together. |
| subtitle | string | Product subtitle. |
| author | string | Author or creator (if applicable). |
*Note: Many of these columns are dropped or renamed during cleaning and feature engineering.
The root directory contains pre-downloaded Amazon review data, prepared primarily to support efficient topic modeling workflows. Two main versions of the dataset are provided:
-
Raw Review Data:
This is a cleaned version of the original Amazon review dataset, with non-essential columns (such as images, various IDs, and the "verified purchase" flag) removed. It is ready for direct use in topic modeling and can also be repurposed for predictive modeling tasks, as the title and text columns remain intact. -
lemma_data Folder:
This folder contains the same review data, but with the title and review text columns preprocessed—lemmatized and stripped of HTML tags. This preprocessed version is optimized for topic modeling, allowing for rapid experimentation without repeated text cleaning. The lemmatized text can also be used for downstream predictive modeling or other NLP tasks.
Key Points:
- Both raw and preprocessed (lemmatized) versions are available for flexibility in modeling.
- The main motivation for pre-downloading and preprocessing was to accelerate topic modeling experiments.
- The title and text columns in both versions can be leveraged for predictive modeling as well.
The primary analysis for this project is contained in main_notebook.ipynb. This notebook walks through the full data science workflow, including:
- Data loading and preprocessing (using scripts in the
utils/directory) - Exploratory data analysis and visualization
- Feature engineering and text cleaning (including lemmatization and HTML removal)
- Predictive modeling (classification of review helpfulness)
- Topic modeling to uncover key themes in helpful reviews
- Evaluation of model performance and interpretation of results
The notebook is designed to be reproducible and modular, with clear section headings and code comments. It leverages the utility scripts for efficient data handling and preprocessing, and provides both code and narrative to guide users through the analysis steps.
This project was developed by:
- Taylor Kirk (tkirk@sandiego.edu)
- Sushama Kafle (skafle@sandiego.edu)
- Luigi Salemi (lsalemi@sandiego.edu)