This repository contains the design and implementation of 5 Data Quality (DQ) checks for the E-Commerce Data dataset. The goal is to ensure the raw transactional data is reliable for downstream analytics, specifically for metrics like Total Revenue, Average Order Value (AOV), and Customer Lifetime Value (CLV).
The DQ checks were written in standard SQL and tested locally using SQLite. Each query is designed to profile the data by returning the count and percentage of violating records.
This repository does not contain the raw data or the SQLite database file to adhere to Git best practices. To execute these checks:
- Download the raw CSV from Kaggle E-Commerce Data.
- Load the CSV into your preferred local database (e.g., SQLite, DuckDB, or PostgreSQL).
- Execute the queries found in
checks.sql.
- What it verifies: Ensures that negative quantities only occur for cancellations (
InvoiceNostarts with 'C'), and normal sales have positive quantities. - DQ Dimension: Consistency
- Severity: Critical (Breaks inventory and sales volume calculations).
- Result: Found 1,336 rows (0.25%) violating this rule.
- What it verifies: Ensures products are sold for a valid price (> 0).
- DQ Dimension: Validity / Accuracy
- Severity: Critical (Directly impacts financial reporting and Revenue).
- Result: Found 2,517 rows (0.46%) with zero or negative prices.
- What it verifies: Checks if the transaction is linked to a registered customer.
- DQ Dimension: Completeness
- Severity: Warning (Guest checkouts legitimately lack a CustomerID and should still count toward total revenue, but must be filtered out for CLV models).
- Result: Found 135,080 rows (24.93%) missing a Customer ID.
- What it verifies: Catches pipeline errors or front-end glitches that duplicated a transaction line entirely.
- DQ Dimension: Uniqueness
- Severity: Critical (Duplicates artificially inflate every single metric).
- Result: Found 4,882 duplicate rows based on exact matches across all core columns.
- What it verifies: Ensures every transaction includes a human-readable product description.
- DQ Dimension: Completeness
- Severity: Warning (Product-level dashboards will show blanks).
- Result: Found 1454 rows (0.27%) violating this rule.
Andrii Kozlov - andrijkozlov96@gmail.com | https://t.me/AndrewKozz | https://www.linkedin.com/in/andrii-kozlov96