This project implements a complete data pipeline for analyzing road accidents in France using Databricks. The pipeline covers the entire process from data ingestion, transformation, and loading (ETL) to generating analytical reports and dashboards. It leverages Databricks' unified analytics platform to process large datasets efficiently and provide insights into road safety trends.
The project is structured as a multi-module application with separate components for data ingestion, ETL processing, and analytics, following best practices for scalable data engineering.
- Data Ingestion: Automated ingestion of road accident data from various sources including CSV files for accident characteristics, sites, vehicles, victims, and geographic references.
- ETL Pipeline: Comprehensive data transformation and cleaning processes to prepare data for analysis.
- Analytics and Reporting: Advanced analytics on accident risks, patterns, and trends with interactive dashboards.
- Databricks Integration: Full utilization of Databricks features including Delta Lake, MLflow, and Databricks Jobs.
- Modular Architecture: Organized into ingestion, ETL, and analytics modules for maintainability and scalability.
- Testing: Comprehensive test suite using pytest for ensuring data quality and pipeline reliability.
analysis_road_accidents_databricks/
├── analysis_road_accidents_ingestion/ # Data ingestion module
│ ├── src/
│ │ ├── accidents_ingestion/
│ │ └── references_ingestion/
│ ├── resources/ # Databricks job configurations
│ ├── tests/
│ └── databricks.yml
├── analysis_road_accidents_etl/ # ETL processing module
│ ├── src/
│ │ ├── dataprep/ # Data preparation transformations
│ │ └── analytics/ # Analytics transformations
│ ├── resources/ # Pipeline configurations
│ ├── tests/
│ └── databricks.yml
├── databricks_common_config/ # Shared configurations
├── data/ # Raw data files
│ ├── accidents routes/ # Accident data CSVs
│ └── referentiel-geographique/ # Geographic reference data
├── reports/ # Dashboard and report files
├── INSTALL.md # Installation instructions
├── Setup for Road Accident Analysis Project.ipynb # Setup notebook
└── README.md # This file
- Databricks workspace with appropriate permissions
- Python 3.8+
- Databricks CLI
- Access to Azure Blob Storage or equivalent for data storage (if applicable)
- Git for version control
-
Clone the repository:
git clone <repository-url> cd analysis_road_accidents_databricks
-
Configure Databricks:
- Set up your Databricks workspace
- Configure authentication (personal access token or Azure AD)
- Update
databricks_common_config/targets.ymlwith your workspace details
-
Install dependencies:
- For each module (ingestion, etl), navigate to the directory and install:
cd analysis_road_accidents_ingestion pip install -e . cd ../analysis_road_accidents_etl pip install -e .
- For each module (ingestion, etl), navigate to the directory and install:
-
Data Setup:
- Place raw data files in the
data/directory - Ensure data follows the expected schema (refer to schemas in
analysis_road_accidents_etl/src/analytics/schemas/)
- Place raw data files in the
-
Run Setup Notebook:
- Open
Setup for Road Accident Analysis Project.ipynbin Databricks - Execute the notebook to initialize the environment and create necessary databases/tables
- Open
For detailed installation steps, refer to INSTALL.md.
-
Data Ingestion:
- Deploy and run the ingestion jobs using Databricks Jobs
- Jobs are configured in
analysis_road_accidents_ingestion/resources/
-
ETL Processing:
- Execute the ETL pipelines defined in
analysis_road_accidents_etl/resources/ - This includes data cleaning, transformation, and loading into Delta tables
- Execute the ETL pipelines defined in
-
Analytics and Reporting:
- Run analytics transformations in
analysis_road_accidents_etl/src/analytics/ - Access reports and dashboards in the
reports/directory
- Run analytics transformations in
- Accident Characteristics: Processing of accident metadata
- Accident Sites: Geographic and environmental factors
- Vehicles: Vehicle-related accident data
- Victims: Casualty and injury information
- Geographic Reference: Location mapping and reference data
# From each module directory
pytest tests/The project uses official French road accident data, including:
- Accident characteristics (caractéristiques)
- Accident locations (lieux)
- Vehicle information (véhicules)
- Victim details (usagers)
- Geographic reference system
Data is sourced from French government databases and covers multiple years.
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Based on Udemy course: Databricks Data Engineer Associate Certification
- Data provided by French government road safety authorities
- Built using Databricks platform and Apache Spark