Road Accident Analysis on Databricks

Overview

This project implements a complete data pipeline for analyzing road accidents in France using Databricks. The pipeline covers the entire process from data ingestion, transformation, and loading (ETL) to generating analytical reports and dashboards. It leverages Databricks' unified analytics platform to process large datasets efficiently and provide insights into road safety trends.

The project is structured as a multi-module application with separate components for data ingestion, ETL processing, and analytics, following best practices for scalable data engineering.

Features

Data Ingestion: Automated ingestion of road accident data from various sources including CSV files for accident characteristics, sites, vehicles, victims, and geographic references.
ETL Pipeline: Comprehensive data transformation and cleaning processes to prepare data for analysis.
Analytics and Reporting: Advanced analytics on accident risks, patterns, and trends with interactive dashboards.
Databricks Integration: Full utilization of Databricks features including Delta Lake, MLflow, and Databricks Jobs.
Modular Architecture: Organized into ingestion, ETL, and analytics modules for maintainability and scalability.
Testing: Comprehensive test suite using pytest for ensuring data quality and pipeline reliability.

Project Structure

analysis_road_accidents_databricks/
├── analysis_road_accidents_ingestion/    # Data ingestion module
│   ├── src/
│   │   ├── accidents_ingestion/
│   │   └── references_ingestion/
│   ├── resources/                        # Databricks job configurations
│   ├── tests/
│   └── databricks.yml
├── analysis_road_accidents_etl/          # ETL processing module
│   ├── src/
│   │   ├── dataprep/                     # Data preparation transformations
│   │   └── analytics/                    # Analytics transformations
│   ├── resources/                        # Pipeline configurations
│   ├── tests/
│   └── databricks.yml
├── databricks_common_config/             # Shared configurations
├── data/                                 # Raw data files
│   ├── accidents routes/                 # Accident data CSVs
│   └── referentiel-geographique/         # Geographic reference data
├── reports/                              # Dashboard and report files
├── INSTALL.md                            # Installation instructions
├── Setup for Road Accident Analysis Project.ipynb  # Setup notebook
└── README.md                             # This file

Prerequisites

Databricks workspace with appropriate permissions
Python 3.8+
Databricks CLI
Access to Azure Blob Storage or equivalent for data storage (if applicable)
Git for version control

Installation and Setup

Clone the repository:

git clone <repository-url>
cd analysis_road_accidents_databricks

Configure Databricks:
- Set up your Databricks workspace
- Configure authentication (personal access token or Azure AD)
- Update databricks_common_config/targets.yml with your workspace details

Install dependencies:

For each module (ingestion, etl), navigate to the directory and install:

cd analysis_road_accidents_ingestion
pip install -e .
cd ../analysis_road_accidents_etl
pip install -e .

Data Setup:
- Place raw data files in the data/ directory
- Ensure data follows the expected schema (refer to schemas in analysis_road_accidents_etl/src/analytics/schemas/)
Run Setup Notebook:
- Open Setup for Road Accident Analysis Project.ipynb in Databricks
- Execute the notebook to initialize the environment and create necessary databases/tables

For detailed installation steps, refer to INSTALL.md.

Usage

Running the Pipeline

Data Ingestion:
- Deploy and run the ingestion jobs using Databricks Jobs
- Jobs are configured in analysis_road_accidents_ingestion/resources/
ETL Processing:
- Execute the ETL pipelines defined in analysis_road_accidents_etl/resources/
- This includes data cleaning, transformation, and loading into Delta tables
Analytics and Reporting:
- Run analytics transformations in analysis_road_accidents_etl/src/analytics/
- Access reports and dashboards in the reports/ directory

Key Components

Accident Characteristics: Processing of accident metadata
Accident Sites: Geographic and environmental factors
Vehicles: Vehicle-related accident data
Victims: Casualty and injury information
Geographic Reference: Location mapping and reference data

Running Tests

# From each module directory
pytest tests/

Data Sources

The project uses official French road accident data, including:

Accident characteristics (caractéristiques)
Accident locations (lieux)
Vehicle information (véhicules)
Victim details (usagers)
Geographic reference system

Data is sourced from French government databases and covers multiple years.

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Based on Udemy course: Databricks Data Engineer Associate Certification
Data provided by French government road safety authorities
Built using Databricks platform and Apache Spark

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Road Accident Analysis on Databricks

Overview

Features

Project Structure

Prerequisites

Installation and Setup

Usage

Running the Pipeline

Key Components

Running Tests

Data Sources

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
analysis_road_accidents_etl		analysis_road_accidents_etl
analysis_road_accidents_ingestion		analysis_road_accidents_ingestion
data		data
databricks_common_config		databricks_common_config
reports		reports
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Setup for Road Accident Analysis Project.ipynb		Setup for Road Accident Analysis Project.ipynb

Folders and files

Latest commit

History

Repository files navigation

Road Accident Analysis on Databricks

Overview

Features

Project Structure

Prerequisites

Installation and Setup

Usage

Running the Pipeline

Key Components

Running Tests

Data Sources

Contributing

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages