Skip to content

0805gunjan/Real_Time_Ingestion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Real-Time Delta Lake Ingestion Pipeline

This project simulates a real-time data ingestion pipeline using Delta Lake on Databricks. It continuously appends fake data (Name, Address, Email) to a Delta table at scheduled intervals, tracks Delta versions, and sends HTML email summaries after each run.

The pipeline includes:

  • Automatic fake data generation (Faker)
  • Writing to a Delta table in append mode
  • Tracking Delta version history and newly ingested rows
  • Sending HTML email summaries after every ingestion
  • Automated scheduling every 5 minutes using Databricks Jobs

Key Features

1. Real-Time Fake Data Generation

Creates synthetic records:

  • Name
  • Address
  • Email
    With a configurable row count.

2. Delta Lake Storage

  • Appends new rows on every run
  • Adds ingestion timestamp
  • Ensures ACID transactions
  • Supports time travel and versioning

3. Delta Version Tracking

On every trigger, the pipeline automatically extracts:

  • Latest Delta version
  • Version timestamp
  • Recently appended rows
  • Full Delta history entry

4. Automated Email Notification

Each run sends an HTML email containing:

  • Number of rows ingested
  • Current ingestion timestamp
  • Delta version & version timestamp
  • A preview table of appended rows

5. 5-Minute Automated Scheduling

This notebook is scheduled via Databricks Jobs to run every 5 minutes, providing near real-time ingestion behavior.


Pipeline Flow

  • Generate Fake Data β†’ Add Timestamp β†’ Append to Delta β†’ Track Version Changes β†’ Convert to HTML β†’ Send Email β†’ Scheduled Every 5 Minutes

Technologies Used

Component Technology
Compute Azure Databricks
Storage ADLS Gen2
Format Delta Lake
Generator Faker
Notifications SMTP Email
Scheduling Databricks Jobs
Language PySpark (+ Python)

Folder Structure

.
β”œβ”€β”€ notebooks/
β”‚   └── real_time_pipeline/ delta-table-ingestion
β”œβ”€β”€ data/
β”‚   └── fake_data_table/ fake_data_table/
β”œβ”€β”€ job_scheduling_proof/
β”‚   β”œβ”€β”€job_scheduling_screenshots
|   └──email_summary_screenshots
β”œβ”€β”€ README.md

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages