GitHub - themrityunjaypathak/Dashly: Get smarter insights, right when you need them

Dashly : Live Sales Dashboard

Problem Statement

Quick Buy is a superstore operating across the United States.
Performance tracking relied heavily on manual spreadsheets and ad-hoc SQL queries.
As a result, decision-making slowed down, making it harder to identify growth opportunities.
The goal was to automate the data workflow and deliver an up-to-date sales dashboard for informed decisions.

Overview

Designed an ETL pipeline with Python and SQLAlchemy to load 50K+ sales records into a PostgreSQL database.
Simulated ~100 new transactions daily to replicate ongoing business activity and validate pipeline reliability.
Connected Power BI to PostgreSQL to deliver an auto-refreshing dashboard with no manual updates.

Workflow

ER Diagram

The ER (Entity-Relationship) diagram visually represents how different tables in the database are related.

Relationships

One Customer ➜ Many Orders
- Each customer can place multiple orders.
One Product ➜ Many Orders
- Each product can appear in multiple orders.
Orders Table ➜ Central Table
- Serves as the main transactional table, linking customers and products.

Database Schema

The database is designed to store and organize Quick Buy's orders, customers, and product data.

It ensures that all business data is centralized, consistent, and easy to query for analysis and dashboarding.

Click Here to view Schema Definition

/* Customers Table */
CREATE TABLE IF NOT EXISTS customers (
  customer_id TEXT PRIMARY KEY,
  customer_name TEXT,
  segment TEXT,
  city TEXT,
  state TEXT,
  country TEXT,
  postal_code NUMERIC,
  region TEXT
);

/* Products Table */
CREATE TABLE IF NOT EXISTS products (
  product_id TEXT PRIMARY KEY,
  product_name TEXT,
  category TEXT,
  sub_category TEXT
);

/* Orders Table */
CREATE TABLE IF NOT EXISTS orders (
  order_id TEXT PRIMARY KEY,
  order_date DATE,
  customer_id TEXT,
  product_id TEXT,
  ship_mode TEXT,
  ship_date DATE,
  sales NUMERIC,
  quantity INTEGER,
  discount NUMERIC,
  profit NUMERIC,
  shipping_duration INTEGER,
  profit_margin NUMERIC,
  FOREIGN KEY (customer_id) REFERENCES customers(customer_id),
  FOREIGN KEY (product_id) REFERENCES products(product_id)
);

SQL Views

SQL views are used to make data analysis easier and keep business metrics consistent for dashboards.

Instead of running complex queries every time, Power BI connects directly to these views to fetch clean data.

Click Here to view SQL Views

/* segment_wise_sales_and_profit */
/* Calculates total sales and profit for each customer segment. */
CREATE OR REPLACE VIEW segment_wise_sales_and_profit AS
SELECT 
    c.segment,
    SUM(o.sales) AS total_sales,
    SUM(o.profit) AS total_profit
FROM orders AS o
JOIN customers AS c
ON o.customer_id = c.customer_id
GROUP BY c.segment;

/* region_wise_sales_and_profit */
/* Summarizes total sales and profit across all regions. */
CREATE OR REPLACE VIEW region_wise_sales_and_profit AS
SELECT 
    c.region,
    SUM(o.sales) AS total_sales,
    SUM(o.profit) AS total_profit
FROM orders AS o
JOIN customers AS c
ON o.customer_id = c.customer_id
GROUP BY c.region;

/* month_wise_sales_and_profit */
/* Shows monthly trends of total sales and profit. */
CREATE OR REPLACE VIEW month_wise_sales_and_profit AS 
SELECT
    TO_CHAR(order_date, 'Mon') AS month, 
    SUM(sales) AS total_sales,
    SUM(profit) AS total_profit
FROM orders
GROUP BY month;

/* top_customers_by_sales */
/* Lists customers with their total sales and profit to identify top performers. */
CREATE OR REPLACE VIEW top_customers_by_sales AS
SELECT
    c.customer_id, 
    c.customer_name,
    SUM(o.sales) AS total_sales,
    SUM(o.profit) AS total_profit
FROM orders AS o
JOIN customers AS c
ON o.customer_id = c.customer_id
GROUP BY c.customer_id, c.customer_name;

/* shipping_performance */
/* Analyzes sales and profit performance by shipping mode. */
CREATE OR REPLACE VIEW shipping_performance AS 
SELECT 
    ship_mode,
    SUM(sales) AS total_sales,
    SUM(profit) AS total_profit
FROM orders
GROUP BY ship_mode;

/* overall_sales_performance */
/* Provides overall business KPIs like total sales, profit, orders and customers. */
CREATE OR REPLACE VIEW overall_sales_performance AS
SELECT
    SUM(sales) AS total_sales,
    SUM(profit) AS total_profit,
    COUNT(DISTINCT order_id) AS total_orders,
    COUNT(DISTINCT customer_id) AS total_customers,
    COUNT(DISTINCT product_id) AS total_products,
    SUM(quantity) AS total_quantity_sold
FROM orders;

/* state_wise_sales_and_customer_base */
/* Displays total sales and customer count by U.S. states. */
CREATE OR REPLACE VIEW state_wise_sales_and_customer_base AS 
SELECT
    c.state,
    SUM(o.sales) AS total_sales,
    COUNT(DISTINCT c.customer_id) AS total_customers
FROM orders AS o
JOIN customers AS c
ON o.customer_id = c.customer_id
GROUP BY c.state;

/* segment_wise_monthly_sales_and_profit */
/* Tracks monthly sales and profit performance for each customer segment. */
CREATE OR REPLACE VIEW segment_wise_monthly_sales_and_profit AS
SELECT
    c.segment,
    TO_CHAR(o.order_date, 'Mon') AS month_name,
    SUM(o.sales) AS total_sales,
    SUM(o.profit) AS total_profit
FROM orders AS o
JOIN customers AS c
ON o.customer_id = c.customer_id
GROUP BY c.segment, month_name;

/* region_wise_monthly_sales */
/* Shows monthly sales trends for each region. */
CREATE OR REPLACE VIEW region_wise_monthly_sales AS
SELECT
    c.region,
    TO_CHAR(o.order_date, 'Mon') AS month_name,
    SUM(o.sales) AS total_sales
FROM orders AS o
JOIN customers AS c
ON o.customer_id = c.customer_id
GROUP BY c.region, month_name;

/* overall_customers_performance */
/* Calculates average sales, profit, orders and quantity per customer. */
CREATE OR REPLACE VIEW overall_customers_performance AS
SELECT 
    ROUND(SUM(o.sales)/COUNT(DISTINCT o.customer_id)) AS avg_sales_per_customer,
    ROUND(SUM(o.profit)/COUNT(DISTINCT o.customer_id)) AS avg_profit_per_customer,
    ROUND(COUNT(DISTINCT order_id)/COUNT(DISTINCT customer_id)) AS avg_orders_per_customer,
    ROUND(SUM(o.quantity)/COUNT(DISTINCT o.customer_id)) AS avg_quantity_per_customer
FROM orders AS o;

/* avg_discount_per_order_per_customer */
/* Computes the average discount per customer across all orders. */
CREATE OR REPLACE VIEW avg_discount_per_order_per_customer AS
SELECT
    ROUND(AVG(customer_avg), 2) AS avg_discount_per_customer
FROM (
    SELECT customer_id, AVG(discount) AS customer_avg
    FROM orders
    GROUP BY customer_id
) AS sub;

/* category_wise_monthly_sales_and_profit */
/* Tracks monthly sales and profit for each product category. */
CREATE OR REPLACE VIEW category_wise_monthly_sales_and_profit AS
SELECT
    p.category,
    TO_CHAR(o.order_date, 'Mon') AS month,
    SUM(o.sales) AS total_sales,
    SUM(o.profit) AS total_profit
FROM orders AS o
JOIN products AS p
ON o.product_id = p.product_id
GROUP BY p.category, month;

/* sub_category_wise_sales_and_profit */
/* Summarizes total sales and profit by product sub-category. */
CREATE OR REPLACE VIEW sub_category_wise_sales_and_profit AS
SELECT
    p.sub_category,
    SUM(o.sales) AS total_sales,
    SUM(o.profit) AS total_profit
FROM orders AS o
JOIN products AS p
ON o.product_id = p.product_id
GROUP BY p.sub_category;

/* category_wise_sales_profit_and_orders */
/* Shows total sales, profit and order count by product category. */
CREATE OR REPLACE VIEW category_wise_sales_profit_and_orders AS
SELECT
    p.category,
    SUM(o.sales) AS total_sales,
    SUM(o.profit) AS total_profit,
    COUNT(DISTINCT o.order_id) AS total_orders
FROM orders AS o
JOIN products AS p
ON o.product_id = p.product_id
GROUP BY p.category;

/* state_wise_most_purchased_sub_category */
/* Identifies the most purchased sub-category in each U.S. state. */
CREATE OR REPLACE VIEW state_wise_most_purchased_sub_category AS
SELECT
    c.state,
    p.sub_category, 
    SUM(o.quantity) AS quantity_sold,
    RANK() OVER (PARTITION BY c.state ORDER BY SUM(o.quantity) DESC) AS sub_category_rank
FROM orders AS o
JOIN products AS p
ON o.product_id = p.product_id
JOIN customers AS c
ON o.customer_id = c.customer_id
GROUP BY c.state, p.sub_category;

Setup

1. Clone the Repository

First, clone the project from GitHub to your local system.

git clone https://github.com/themrityunjaypathak/Dashly.git

2. Set Up a Virtual Environment

To avoid version conflicts and keep your project isolated, create a virtual environment.

On Windows :

python -m venv .venv

On macOS/Linux :

python3 -m venv .venv

3. Activate the Virtual Environment

After setting up the virtual environment, activate it to begin installing dependencies.

On Windows :

.\.venv\Scripts\activate

On macOS/Linux :

source .venv/bin/activate

4. Install the Project Dependencies

Now, install all the required libraries inside your virtual environment using the requirements.txt file.

pip install -r requirements.txt

Tip

It's a good idea to upgrade pip before installing dependencies to avoid compatibility issues.

pip install --upgrade pip

Note

Use the same Python version as in .github/workflows/etl_pipeline.yaml to avoid compatibility issues.

5. Setup Environment Variables

This project uses a .env file to store database credentials like DB_USER, DB_PASS, DB_NAME, etc.

# .env
DB_HOST=host_name
DB_NAME=database_name
DB_USER=user_name
DB_PASS=password
KAGGLE_USERNAME=kaggle_username
KAGGLE_KEY=kaggle_api_key

Important

Make sure not to commit your .env file to GitHub or any public repositories.

You can add it to .gitignore to ensure it's excluded from version control.

Note

If you want to create a free Database in Neon and connect it with Python, go to How To section.

6. Database Connectivity Check

Confirm that the PostgreSQL connection works before running ETL scripts.

This avoids script crashes due to invalid credentials or blocked ports.

Click Here to view Code Snippet

# Importing Libraries
import os
from dotenv import load_dotenv
from sqlalchemy import create_engine

# Loading Environment File
load_dotenv()

# Loading Database Credentials from Environment File
DB_HOST = os.getenv("DB_HOST")
DB_NAME = os.getenv("DB_NAME")
DB_USER = os.getenv("DB_USER")
DB_PASS = os.getenv("DB_PASS")

# Creating SQLAlchemy Engine
engine = create_engine(f"postgresql://{DB_USER}:{DB_PASS}@{DB_HOST}/{DB_NAME}?sslmode=require&channel_binding=require", pool_pre_ping=True)

7. Run ETL Script

This initializes the database and :

Cleans raw CSV data
Creates tables (customers, orders, products)
Loads data into the Neon PostgreSQL database

python scripts/etl.py

Note

Run this only once initially or when you want a full database refresh.

8. Create SQL Views

This script builds reusable SQL views that summarize business metrics for the Power BI dashboard.

It simplifies queries, ensures consistent logic, and improves performance.

python scripts/create_views.py

9. Generate New Data

Simulates daily transactions by generating new random data for testing pipeline automation.

Helps verify how dashboards respond to new data over time.

python scripts/generate_data.py

10. Export Views as CSVs

Exports SQL view results to CSV files inside the views/ folder.

This is useful for sharing datasets or validating dashboard data without connecting to the database.

python scripts/export_views.py

11. Check Logs

Check log files inside the logs/ folder :
- etl.log : Initial data loading
- create_views.log : SQL views creation
- generate_data.log : Daily data generation
Logs help you monitor pipeline performance and troubleshoot errors quickly.

ETL Pipeline

The ETL (Extract, Transform, Load) pipeline is the core part of this project.
It automatically cleans and loads sales data into a PostgreSQL database for the Power BI dashboard.
It is built with Python using SQLAlchemy and is securely configured via environment variables.

ETL Pipeline Structure

Script Name	Purpose
`etl.py`	Sets up the database schema, cleans the dataset, and loads initial data into the database.
`create_views.py`	Creates multiple SQL views that summarize and aggregate data for the Power BI dashboard.
`generate_data.py`	Generates random synthetic transaction data to simulate daily updates in the database.

How does the ETL pipeline work?

1. `etl.py`

This script handles the first step of the process by preparing the database.

What does it do?

Load Configuration
- Reads environment variables (like DB_HOST, DB_NAME) from a .env file for secure database access.
Logging Setup
- Creates a logs/etl.log file to track all ETL activity and errors.
Extract Data
- Loads raw data from a CSV file using a custom load_csv() utility function.
Transform Data
- Removes duplicates, standardizes column names, and optimizes data types.
Load Data
- Creates tables in the Neon PostgreSQL database and loads the cleaned data using the to_sql() function.
Schema Management
- Ensures relationships between tables using foreign keys and maintains data integrity.

2. `create_views.py`

This script builds SQL views in the PostgreSQL database to simplify analysis and reporting in Power BI.

What does it do?

Database Connection
- Connects to the database securely using environment variables.
Define SQL Views
- Creates multiple SQL views to summarize and aggregate key business insights.
Execute & Commit
- Executes each CREATE OR REPLACE VIEW statement and commits changes.
Logging Setup
- Stores execution logs in logs/create_views.log.

3. `generate_data.py`

This script keeps the database updated with new transaction data for scheduled data refresh in Power BI.

What does it do?

Generate Random Data
- Uses custom utility functions to create synthetic customer and order data.
Data Cleaning
- Removes duplicates and optimizes data types before uploading to the Neon database.
Append Unique Data
- Inserts only new records into the database, avoiding duplicates.
Logging Setup
- Saves process logs in logs/generate_data.log.

Once this cycle is complete, the process repeats automatically :

generate_data.py ➜ create_views.py ➜ Power BI Refresh ➜ New Insights

This ensures that the Power BI Dashboard always displays the latest insights automatically.

Results & Insights

This section highlights the key outcomes and insights generated from the ETL pipeline and Power BI dashboards.

Pipeline Performance

This section summarizes pipeline performance metrics such as runtime, automation frequency, and reliability.

1. Data Loading Overview

Parameter	Value
Dataset Size	~50,000 sales records
Tables Used	`customers`, `orders`, `products`
Avg. Daily Inserts	~100 new records
Database	Neon PostgreSQL (cloud-hosted)

Note

This setup simulates ongoing business activity with daily updates to the orders and customers tables.

2. Automation & Scheduling

Attribute	Details
Automation Tool	GitHub Actions
Execution Frequency	Daily
Scheduled Time	10:00 AM IST
Trigger Type	`cron` (automated) and `workflow_dispatch` (manual)
Runner Environment	`ubuntu-latest` (GitHub-hosted Ubuntu runner)

Note

This setup ensures that the latest data is always available for Power BI dashboards, with no manual effort.

3. Runtime Performance

Workflow Step	Description
Set up job	Initializes GitHub Actions environment
Checkout repository	Pulls repository code into the runner
Set up Python	Installs Python environment (v3.12)
Install dependencies	Installs libraries from `requirements.txt`
Run ETL Script	Extracts, transforms, and loads data into PostgreSQL
Run Generate Data Script	Generates new synthetic customer and order data
Run Views Script	Creates / Refreshes analytical SQL views
Run Export Views Script	Exports SQL views as CSV files
Upload Exported CSV as Artifacts	Uploads exported CSVs to GitHub Actions artifacts
Commit and Push CSVs	Commits CSV files to the repository
Upload Logs as Artifacts	Uploads log files for debugging and tracking
Post Setup / Cleanup Steps	Cleans the environment post-run

Note

Total runtime : ~47 seconds per pipeline run

Scheduling : The workflow runs daily at 10:00 AM IST using a cron schedule (30 4 * * *).

cron is in UTC (04:30 UTC = 10:00 AM IST)

The ETL pipeline runs within a minute, automatically refreshing dashboard data daily with no manual effort.

4. Error Handling and Logging

Aspect	Implementation Details
Error Tracking	Structured `try–except` error handling in each script
Log Files	`etl.log`, `generate_data.log`, `create_views.log`
Log Storage	Uploaded as GitHub Actions run artifacts
Security	All credentials securely stored in GitHub Secrets (`DB_USER`, `DB_PASS`, etc.)

Tip

Automated logging and secret handling remove the need for manual checks and ensure smooth workflow runs.

5. Reliability and Stability

Metric	Value	Remarks
Total Runtime	~47 seconds	Fast for a daily automated ETL pipeline
Success Rate	100%	Verified via GitHub Actions workflow panel
Avg. Records Inserted	~100 rows/day	Lightweight daily incremental updates
Resource Utilization	Low CPU and memory usage	Efficient for cloud runners

Important

The pipeline runs fully unattended, ensuring consistent daily data updates and automatic Power BI refreshes.

Dashboard Metrics

This section highlights key business insights and trends derived from the Power BI dashboard visualizations.