Architecture Review: Handling Schema Evolution in a Netflix ETL Pipeline #1
-
|
Hello community! I just released v1.0 of my Netflix Data Pipeline, and I'm looking for some feedback on the architectural decisions I made. Project Context: 🤔 Specific Questions for you:
Repo for context: https://github.com/Thiago-code-lab/data-engineering-netflix Thanks in advance for any code reviews or architectural tips! I'm trying to adhere to clean code principles as much as possible. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
Great project! Here are some simplified insights for your architecture: |
Beta Was this translation helpful? Give feedback.
Great project! Here are some simplified insights for your architecture:
Data Quality: Definitely move to a Dead Letter Queue (DLQ). Dropping rows makes you lose visibility into why data is failing. Storing them separately allows for later auditing and reprocessing.
Orchestration: For <10k rows, Airflow is likely overkill. Stick to your script or use Prefect/Dagster if you need better UI and retry logic without the heavy infrastructure of Airflow.
Docker Networking: Ensure your DB container is not exposing ports to the public internet (remove the ports mapping in docker-compose if only the app needs access). Use a private Docker network and environment variables (secret files) for credenti…