Skip to content

Jaykrushna369/BigData_Project_TokenCountAnalysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BigData_Project_TokenCountAnalysis

Big Data 5th Semester Project — Token Count Analysis for Online Retail Data using PySpark

🧠 Big Data Project — Token Count Analysis on Online Retail Data

📌 Project Overview

This project performs token count analysis on an online retail dataset using Apache Spark (PySpark).
The goal is to apply Big Data analytics and text processing (NLP) techniques to identify the most common words, products, and sales insights from large-scale retail data.


🎯 Objectives

  • Load and clean a large dataset using PySpark.
  • Perform tokenization and stopword removal on product descriptions.
  • Count word (token) frequencies to identify popular keywords.
  • Visualize the results using charts and word clouds.
  • Demonstrate Big Data workflows with Spark and AI-inspired analysis.

⚙️ Tech Stack

Category Tools / Libraries
Programming Python
Big Data Framework Apache Spark (PySpark)
Visualization Matplotlib, Seaborn, WordCloud
Data Handling Pandas, PySpark DataFrames
Environment Jupyter Notebook / VS Code

📂 Folder Structure

BigData_Project_TokenCountAnalysis/ │ ├── data/ │ ├── raw/ # Original Excel dataset │ ├── processed/ # Converted CSV and cleaned data │ └── README.md │ ├── notebooks/ │ ├── 1_data_loading.ipynb # Load raw data (Excel → CSV) │ ├── 2_data_cleaning.ipynb # Data cleaning and preprocessing │ ├── 3_tokenization.ipynb # Text tokenization and cleaning │ ├── 4_token_count.ipynb # Count token frequencies │ ├── 5_visualization.ipynb # Visual analysis and word cloud │ └── README.md │ ├── results/ │ ├── final_cleaned.csv # Cleaned dataset │ ├── token_counts.csv # Word frequency results │ └── README.md │ ├── scripts/ │ ├── token_count.py # Script version of token count logic │ └── README.md │ ├── requirements.txt └── README.md # ← This file


🧩 Step-by-Step Process

1️⃣ Data Loading

Notebook: 1_data_loading.ipynb

  • Loads raw Excel dataset (OnlineRetail.xlsx)
  • Converts it into CSV format for Spark processing

2️⃣ Data Cleaning

Notebook: 2_data_cleaning.ipynb

  • Removes duplicates, nulls, and cancelled invoices
  • Outputs: final_cleaned.csv

3️⃣ Tokenization

Notebook: 3_tokenization.ipynb

  • Converts product descriptions to lowercase
  • Removes symbols and punctuation
  • Splits text into tokens (words)

4️⃣ Token Count

Notebook: 4_token_count.ipynb

  • Removes stopwords
  • Counts each token’s frequency
  • Outputs: token_counts.csv

5️⃣ Visualization

Notebook: 5_visualization.ipynb

  • Plots:
    • Top product descriptions
    • Orders by country
    • Total sales by country
    • Top tokens (keywords)
    • Word cloud visualization
  • Prints insight summary

📊 Example Insights

Top Product: Paper chain kit vintage christmas
Most Frequent Token: set (appears over 1500 times)
Top Country by Orders: United Kingdom
Top Country by Sales: Netherlands


🧠 Results

  • final_cleaned.csv — Clean dataset (ready for analysis)
  • token_counts.csv — Word frequency results
  • Visual outputs:
    • Bar charts for tokens, countries, and products
    • Word cloud for descriptive insights

🚀 How to Run

  1. Clone or download the repository.
  2. Open the folder in VS Code or Jupyter Notebook.
  3. Run notebooks in this order: 1_data_loading.ipynb 2_data_cleaning.ipynb 3_tokenization.ipynb 4_token_count.ipynb 5_visualization.ipynb
  4. Check final results in /results/ folder.

🧰 Requirements

Install dependencies before running:

pip install pyspark pandas matplotlib seaborn wordcloud

📘 References

Dataset Source: UCI Machine Learning Repository - Online Retail Dataset

Apache Spark Documentation: https://spark.apache.org/docs/latest/

🏁 Project Summary

This project demonstrates how Big Data tools (PySpark) can efficiently clean, process, and analyze text-based retail data.
The combination of Spark + Python visualization provides a complete pipeline from data loading to AI-style insights.

About

Big Data 5th Semester Project — Token Count Analysis for Online Retail Data using PySpark

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors