Skip to content

This project defines five data quality checks for the E-Commerce dataset to ensure reliability for analytics and reporting. Each check includes dimension classification, SQL logic, and severity level.

Notifications You must be signed in to change notification settings

FirstOne96/data-quality-checks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

DB Data Status

E-Commerce Data Quality Assessment

👀 Overview

This repository contains the design and implementation of 5 Data Quality (DQ) checks for the E-Commerce Data dataset. The goal is to ensure the raw transactional data is reliable for downstream analytics, specifically for metrics like Total Revenue, Average Order Value (AOV), and Customer Lifetime Value (CLV).

⚡️ Methodology

The DQ checks were written in standard SQL and tested locally using SQLite. Each query is designed to profile the data by returning the count and percentage of violating records.

🖥 How to Run Locally

This repository does not contain the raw data or the SQLite database file to adhere to Git best practices. To execute these checks:

  1. Download the raw CSV from Kaggle E-Commerce Data.
  2. Load the CSV into your preferred local database (e.g., SQLite, DuckDB, or PostgreSQL).
  3. Execute the queries found in checks.sql.

🎯 Data Quality Checks

1. Inconsistent Quantity

  • What it verifies: Ensures that negative quantities only occur for cancellations (InvoiceNo starts with 'C'), and normal sales have positive quantities.
  • DQ Dimension: Consistency
  • Severity: Critical (Breaks inventory and sales volume calculations).
  • Result: Found 1,336 rows (0.25%) violating this rule.
image

2. Non-Positive Unit Price

  • What it verifies: Ensures products are sold for a valid price (> 0).
  • DQ Dimension: Validity / Accuracy
  • Severity: Critical (Directly impacts financial reporting and Revenue).
  • Result: Found 2,517 rows (0.46%) with zero or negative prices.
image

3. Missing Customer ID

  • What it verifies: Checks if the transaction is linked to a registered customer.
  • DQ Dimension: Completeness
  • Severity: Warning (Guest checkouts legitimately lack a CustomerID and should still count toward total revenue, but must be filtered out for CLV models).
  • Result: Found 135,080 rows (24.93%) missing a Customer ID.
image

4. Duplicate Invoice Lines

  • What it verifies: Catches pipeline errors or front-end glitches that duplicated a transaction line entirely.
  • DQ Dimension: Uniqueness
  • Severity: Critical (Duplicates artificially inflate every single metric).
  • Result: Found 4,882 duplicate rows based on exact matches across all core columns.
image

5. Missing Product Description

  • What it verifies: Ensures every transaction includes a human-readable product description.
  • DQ Dimension: Completeness
  • Severity: Warning (Product-level dashboards will show blanks).
  • Result: Found 1454 rows (0.27%) violating this rule.
image

📞 Contact:

Andrii Kozlov - andrijkozlov96@gmail.com | https://t.me/AndrewKozz | https://www.linkedin.com/in/andrii-kozlov96

About

This project defines five data quality checks for the E-Commerce dataset to ensure reliability for analytics and reporting. Each check includes dimension classification, SQL logic, and severity level.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors