Big Data 5th Semester Project — Token Count Analysis for Online Retail Data using PySpark
This project performs token count analysis on an online retail dataset using Apache Spark (PySpark).
The goal is to apply Big Data analytics and text processing (NLP) techniques to identify the most common words, products, and sales insights from large-scale retail data.
- Load and clean a large dataset using PySpark.
- Perform tokenization and stopword removal on product descriptions.
- Count word (token) frequencies to identify popular keywords.
- Visualize the results using charts and word clouds.
- Demonstrate Big Data workflows with Spark and AI-inspired analysis.
| Category | Tools / Libraries |
|---|---|
| Programming | Python |
| Big Data Framework | Apache Spark (PySpark) |
| Visualization | Matplotlib, Seaborn, WordCloud |
| Data Handling | Pandas, PySpark DataFrames |
| Environment | Jupyter Notebook / VS Code |
BigData_Project_TokenCountAnalysis/ │ ├── data/ │ ├── raw/ # Original Excel dataset │ ├── processed/ # Converted CSV and cleaned data │ └── README.md │ ├── notebooks/ │ ├── 1_data_loading.ipynb # Load raw data (Excel → CSV) │ ├── 2_data_cleaning.ipynb # Data cleaning and preprocessing │ ├── 3_tokenization.ipynb # Text tokenization and cleaning │ ├── 4_token_count.ipynb # Count token frequencies │ ├── 5_visualization.ipynb # Visual analysis and word cloud │ └── README.md │ ├── results/ │ ├── final_cleaned.csv # Cleaned dataset │ ├── token_counts.csv # Word frequency results │ └── README.md │ ├── scripts/ │ ├── token_count.py # Script version of token count logic │ └── README.md │ ├── requirements.txt └── README.md # ← This file
Notebook: 1_data_loading.ipynb
- Loads raw Excel dataset (
OnlineRetail.xlsx) - Converts it into CSV format for Spark processing
Notebook: 2_data_cleaning.ipynb
- Removes duplicates, nulls, and cancelled invoices
- Outputs:
final_cleaned.csv
Notebook: 3_tokenization.ipynb
- Converts product descriptions to lowercase
- Removes symbols and punctuation
- Splits text into tokens (words)
Notebook: 4_token_count.ipynb
- Removes stopwords
- Counts each token’s frequency
- Outputs:
token_counts.csv
Notebook: 5_visualization.ipynb
- Plots:
- Top product descriptions
- Orders by country
- Total sales by country
- Top tokens (keywords)
- Word cloud visualization
- Prints insight summary
Top Product: Paper chain kit vintage christmas
Most Frequent Token: set (appears over 1500 times)
Top Country by Orders: United Kingdom
Top Country by Sales: Netherlands
final_cleaned.csv— Clean dataset (ready for analysis)token_counts.csv— Word frequency results- Visual outputs:
- Bar charts for tokens, countries, and products
- Word cloud for descriptive insights
- Clone or download the repository.
- Open the folder in VS Code or Jupyter Notebook.
- Run notebooks in this order: 1_data_loading.ipynb 2_data_cleaning.ipynb 3_tokenization.ipynb 4_token_count.ipynb 5_visualization.ipynb
- Check final results in
/results/folder.
Install dependencies before running:
pip install pyspark pandas matplotlib seaborn wordcloud
📘 References
Dataset Source: UCI Machine Learning Repository - Online Retail Dataset
Apache Spark Documentation: https://spark.apache.org/docs/latest/
🏁 Project Summary
This project demonstrates how Big Data tools (PySpark) can efficiently clean, process, and analyze text-based retail data.
The combination of Spark + Python visualization provides a complete pipeline from data loading to AI-style insights.