Architecture Review: Handling Schema Evolution in a Netflix ETL Pipeline #1

Thiago-code-lab · 2026-02-17T19:45:51Z

Thiago-code-lab
Feb 17, 2026
Maintainer

Hello community!

I just released v1.0 of my Netflix Data Pipeline, and I'm looking for some feedback on the architectural decisions I made.

Project Context:
It's a Python-based ETL that ingests raw CSVs, performs heavy cleaning/feature engineering (Pandas), and loads normalized data into PostgreSQL via Docker.

🤔 Specific Questions for you:

Data Quality: Currently, I'm dropping rows that fail validation logic. For a production environment, would you recommend a "Dead Letter Queue" approach instead?
Orchestration: I wrote a custom Python script to trigger the steps (pipeline.py). At what point does it make sense to migrate this to Airflow or Prefect? Is it overkill for a dataset of this size (<10k rows)?
Docker Networking: I'm using docker-compose to bridge the app and the DB. Are there any security best practices I might be missing for the DB container?

Repo for context: https://github.com/Thiago-code-lab/data-engineering-netflix

Thanks in advance for any code reviews or architectural tips! I'm trying to adhere to clean code principles as much as possible.

Answered by Leonardo-cyber-vale

Feb 17, 2026

Great project! Here are some simplified insights for your architecture:
Data Quality: Definitely move to a Dead Letter Queue (DLQ). Dropping rows makes you lose visibility into why data is failing. Storing them separately allows for later auditing and reprocessing.
Orchestration: For <10k rows, Airflow is likely overkill. Stick to your script or use Prefect/Dagster if you need better UI and retry logic without the heavy infrastructure of Airflow.
Docker Networking: Ensure your DB container is not exposing ports to the public internet (remove the ports mapping in docker-compose if only the app needs access). Use a private Docker network and environment variables (secret files) for credenti…

View full answer

Leonardo-cyber-vale · 2026-02-17T19:48:44Z

Leonardo-cyber-vale
Feb 17, 2026

Great project! Here are some simplified insights for your architecture:
Data Quality: Definitely move to a Dead Letter Queue (DLQ). Dropping rows makes you lose visibility into why data is failing. Storing them separately allows for later auditing and reprocessing.
Orchestration: For <10k rows, Airflow is likely overkill. Stick to your script or use Prefect/Dagster if you need better UI and retry logic without the heavy infrastructure of Airflow.
Docker Networking: Ensure your DB container is not exposing ports to the public internet (remove the ports mapping in docker-compose if only the app needs access). Use a private Docker network and environment variables (secret files) for credentials instead of hardcoding them.
If your dataset grows or you need complex dependencies, that's the moment to migrate to a formal orchestrator!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture Review: Handling Schema Evolution in a Netflix ETL Pipeline #1

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Architecture Review: Handling Schema Evolution in a Netflix ETL Pipeline #1

Uh oh!

Uh oh!

Thiago-code-lab Feb 17, 2026 Maintainer

Replies: 1 comment

Uh oh!

Leonardo-cyber-vale Feb 17, 2026

Thiago-code-lab
Feb 17, 2026
Maintainer

Leonardo-cyber-vale
Feb 17, 2026