This project simulates a real-time data ingestion pipeline using Delta Lake on Databricks. It continuously appends fake data (Name, Address, Email) to a Delta table at scheduled intervals, tracks Delta versions, and sends HTML email summaries after each run.
The pipeline includes:
- Automatic fake data generation (Faker)
- Writing to a Delta table in append mode
- Tracking Delta version history and newly ingested rows
- Sending HTML email summaries after every ingestion
- Automated scheduling every 5 minutes using Databricks Jobs
Creates synthetic records:
- Name
- Address
- Email
With a configurable row count.
- Appends new rows on every run
- Adds ingestion timestamp
- Ensures ACID transactions
- Supports time travel and versioning
On every trigger, the pipeline automatically extracts:
- Latest Delta version
- Version timestamp
- Recently appended rows
- Full Delta history entry
Each run sends an HTML email containing:
- Number of rows ingested
- Current ingestion timestamp
- Delta version & version timestamp
- A preview table of appended rows
This notebook is scheduled via Databricks Jobs to run every 5 minutes, providing near real-time ingestion behavior.
- Generate Fake Data β Add Timestamp β Append to Delta β Track Version Changes β Convert to HTML β Send Email β Scheduled Every 5 Minutes
| Component | Technology |
|---|---|
| Compute | Azure Databricks |
| Storage | ADLS Gen2 |
| Format | Delta Lake |
| Generator | Faker |
| Notifications | SMTP Email |
| Scheduling | Databricks Jobs |
| Language | PySpark (+ Python) |
.
βββ notebooks/
β βββ real_time_pipeline/ delta-table-ingestion
βββ data/
β βββ fake_data_table/ fake_data_table/
βββ job_scheduling_proof/
β βββjob_scheduling_screenshots
| βββemail_summary_screenshots
βββ README.md