Project Description

This project analyzes Amazon customer review data to identify what makes a review helpful. Using data science and machine learning, we build predictive models to surface the most helpful reviews and apply topic modeling to uncover the key themes and characteristics of helpful reviews across product categories and ratings.

Problem Statement

The primary problem addressed in this project is: "What characteristics make an Amazon product review helpful?" By leveraging large-scale review datasets, the project seeks to identify linguistic, structural, and contextual features that correlate with helpfulness votes. The goal is twofold: (1) develop predictive models to identify reviews most likely to be found helpful, and (2) use topic modeling to understand how helpful review characteristics differ by category and rating. This helps users write better reviews and enables companies to highlight valuable feedback and gain insights for product or system improvements.

Getting Started

Follow these steps to set up your environment and get started:

Clone the Repository

git clone https://github.com/tkbarb10/ADS_505_Project.git
cd ADS_505_Project

Install uv (Dependency Manager)
- If you don't have uv installed, run:
```
pip install uv
```
Create and Activate a Virtual Environment
- Use uv to create and manage a virtual environment:
```
uv venv
```
- Activate the environment:
  - On Windows:
```
.venv\Scripts\activate
```
  - On Mac/Linux:
```
source .venv/bin/activate
```
Install Project Dependencies
- With the virtual environment activated, install all required packages:
```
uv sync
```
- This will use pyproject.toml and uv.lock to create a reproducible environment.

Package Overview & Special Requirements

Some packages require additional setup or downloads beyond a standard uv sync. Please review the following notes:

1. `huggingface_hub` / `datasets`

The utils/loading_script.py uses Hugging Face’s datasets library to download Amazon review data for each category.
By default, datasets are cached in the user’s ~/.cache/huggingface/datasets directory. You can change this location by setting the HF_DATASETS_CACHE environment variable.
The first run may take a while as the full dataset for each category is downloaded and cached locally.

2. `nltk`

Some functions require NLTK resources (e.g., stopwords) to be downloaded before use.
If you encounter errors about missing NLTK data, run the following in a Python shell:
```
 import nltk
 nltk.download('stopwords')
 nltk.download('punkt')
```
You may need to download additional corpora depending on your use (e.g., wordnet for lemmatization).

3. `emoji`

Used for demojizing text in reviews. No special setup required, but ensure the package is installed.

4. `streamlit`

If you want to use the Streamlit app (if provided), you’ll need to run:
```
 streamlit run <your_app_file.py>
```
No special setup, but ensure your environment includes all Streamlit dependencies.

5. `scikit-learn`, `xgboost`, `seaborn`, `matplotlib`, `pandas`, `numpy`

Standard data science stack. No special setup required beyond installation.

Project Structure Overview

This repository is organized to support efficient data analysis, modeling, and reproducibility. Below is a summary of the main components:

main_notebook.ipynb
The primary Jupyter notebook containing the main analysis, modeling experiments, and results for the project.
utils/
Utility scripts for data loading, preprocessing, HTML removal, topic modeling, and plotting.
- loading_script.py: Functions for loading and processing review data.
- remove_html.py: Removes HTML tags from text fields.
- plot_topics.py: Visualization utilities for topic modeling results.
- lda_script.py, __init__.py: Additional helpers for topic modeling and package structure.
topic_modeling.py
Core functions for lemmatization and topic modeling workflows.
lemma_data/
Preprocessed review data with lemmatized and HTML-stripped text, optimized for topic modeling.
practice_df.csv, trials.log
Example data and logs from model training or hyperparameter optimization.
pyproject.toml, requirements.txt, uv.lock, poetry.lock
Dependency and environment management files.
README.md, LICENSE, notes.txt
Project documentation, license information, and development notes.

Data Dictionary

Column Name	Type	Description
rating	int	Star rating given by the reviewer (1–5).
title_x	string	Title of the review (raw/original, may be empty).
text	string	Full review text (raw/original, may be empty).
images_x	list	List of image URLs included in the review.
asin	string	Amazon Standard Identification Number for the product.
parent_asin	string	Parent ASIN for product grouping.
user_id	string	Unique identifier for the reviewer.
timestamp	int	Unix timestamp (milliseconds) of the review.
helpful_vote	int	Number of users who found the review helpful (target variable).
verified_purchase	bool	Whether the purchase was verified (True/False).
main_category	string	Main product category.
title_y	string	Product/item title from metadata.
average_rating	float	Average rating for the product.
rating_number	int	Number of ratings for the product.
features	list	List of product features.
description	string	Product description.
price	float	Price of the product at the time of review.
images_y	list	List of product image URLs from metadata.
videos	list	List of product video URLs from metadata.
store	string	Store or seller name.
categories	list	List of product categories.
details	dict	Additional product details (may be nested).
bought_together	list	List of products frequently bought together.
subtitle	string	Product subtitle.
author	string	Author or creator (if applicable).

*Note: Many of these columns are dropped or renamed during cleaning and feature engineering.

Review Data Directory Overview

The root directory contains pre-downloaded Amazon review data, prepared primarily to support efficient topic modeling workflows. Two main versions of the dataset are provided:

Raw Review Data:
This is a cleaned version of the original Amazon review dataset, with non-essential columns (such as images, various IDs, and the "verified purchase" flag) removed. It is ready for direct use in topic modeling and can also be repurposed for predictive modeling tasks, as the title and text columns remain intact.
lemma_data Folder:
This folder contains the same review data, but with the title and review text columns preprocessed—lemmatized and stripped of HTML tags. This preprocessed version is optimized for topic modeling, allowing for rapid experimentation without repeated text cleaning. The lemmatized text can also be used for downstream predictive modeling or other NLP tasks.

Key Points:

Both raw and preprocessed (lemmatized) versions are available for flexibility in modeling.
The main motivation for pre-downloading and preprocessing was to accelerate topic modeling experiments.
The title and text columns in both versions can be leveraged for predictive modeling as well.

Main Notebook Overview

The primary analysis for this project is contained in main_notebook.ipynb. This notebook walks through the full data science workflow, including:

Data loading and preprocessing (using scripts in the utils/ directory)
Exploratory data analysis and visualization
Feature engineering and text cleaning (including lemmatization and HTML removal)
Predictive modeling (classification of review helpfulness)
Topic modeling to uncover key themes in helpful reviews
Evaluation of model performance and interpretation of results

The notebook is designed to be reproducible and modular, with clear section headings and code comments. It leverages the utility scripts for efficient data handling and preprocessing, and provides both code and narrative to guide users through the analysis steps.

Authors

This project was developed by:

Taylor Kirk (tkirk@sandiego.edu)
Sushama Kafle (skafle@sandiego.edu)
Luigi Salemi (lsalemi@sandiego.edu)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Description

Problem Statement

Getting Started

Package Overview & Special Requirements

1. `huggingface_hub` / `datasets`

2. `nltk`

3. `emoji`

4. `streamlit`

5. `scikit-learn`, `xgboost`, `seaborn`, `matplotlib`, `pandas`, `numpy`

Project Structure Overview

Data Dictionary

Review Data Directory Overview

Main Notebook Overview

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
individual_notebooks		individual_notebooks
review_data		review_data
saved_models		saved_models
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main_notebook.ipynb		main_notebook.ipynb
pyproject.toml		pyproject.toml
test.ipynb		test.ipynb
topic_modeling.py		topic_modeling.py
uv.lock		uv.lock

License

tkbarb10/ADS_505_Project

Folders and files

Latest commit

History

Repository files navigation

Project Description

Problem Statement

Getting Started

Package Overview & Special Requirements

1. huggingface_hub / datasets

2. nltk

3. emoji

4. streamlit

5. scikit-learn, xgboost, seaborn, matplotlib, pandas, numpy

Project Structure Overview

Data Dictionary

Review Data Directory Overview

Main Notebook Overview

Authors

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

1. `huggingface_hub` / `datasets`

2. `nltk`

3. `emoji`

4. `streamlit`

5. `scikit-learn`, `xgboost`, `seaborn`, `matplotlib`, `pandas`, `numpy`

Packages