This repository provides an end-to-end simulation of 10 common Apache Spark problems and their fixes. It includes sample code demonstrating typical errors such as Task Serialization errors, Data Skew in Joins, OutOfMemory issues, etc. along with the practical fixes. The project is intended as a learning tool to help Spark developers understand and resolve common performance and configuration issues in Spark applications.
- simulations/: Contains individual PySpark scripts for each of the 10 common Spark issues. Each script includes:
- The simulation code that triggers a common Spark problem.
- The corresponding fix to resolve the issue.
- docker-compose.yml: A Docker Compose configuration to set up a simple Spark cluster (one master and two workers) for testing the simulations.
- README.md: This documentation file.
- Docker & Docker Compose: For running the Spark cluster.
- Python 3.x and PySpark: For executing the simulation scripts locally or within the Docker environment.
- [Optional] Java and Scala: Required dependencies for Spark.
-
Clone the repository:
git clone https://github.com/agungatd/Spark-issue-simulation.git cd Spark-issue-simulation -
Start the Spark cluster using Docker Compose:
docker-compose --profile spark up -d
- The Spark master UI will be available at http://localhost:9090.
-
(Optional) Start HDFS for some issue
docker-compose --profile hdfs up -d
- The Hadoop UI will be available at http://localhost:9870.
Then, pre-create the directory and set proper ownership and permissions. For instance:
docker exec namenode hdfs dfs -mkdir -p /user/spark/checkpoints docker exec namenode hdfs dfs -chown spark:supergroup /user/spark/checkpoints docker exec namenode hdfs dfs -chmod 777 /user/spark/checkpoints
Then update your Spark configuration to use this directory.
spark.sparkContext.setCheckpointDir("hdfs://<namenode>:<port>/user/spark/checkpoints")
(Replace
<namenode>:<port>with your actual namenode host and port.)Often, Spark jobs write temporary files to the user’s home directory on HDFS. Make sure that the HDFS home directory for the spark user exists and has proper permissions:
docker exec namenode hdfs dfs -mkdir -p /user/spark docker exec namenode hdfs dfs -chown spark:supergroup /user/spark
Each simulation script can be submitted to the Spark cluster. For example, to run the Task Serialization Error simulation from your local, use:
docker exec spark-master /opt/spark/bin/spark-submit --master spark://spark-master:7077 simulations/1_data_skew.pyReplace 1_data_skew.py with the appropriate simulation file name. Ensure that the Spark master URL matches your Docker Compose configuration.
The repository covers the following simulations:
-
Data Skew
Problem: Uneven data distribution across partitions, causing some executors to process significantly more data than others.
Solution: Use key salting to distribute skewed data more evenly across partitions. -
Out Of Memory Error
Problem: Occurs when too much data is collected to the driver, leading to crashes.
Solution:- Use sampling for analysis instead of collecting full datasets.
- Apply aggregations and transformations at the executor level.
-
Complex Join Issues
Problem: Unintentional cross joins cause exponential data growth.
Solution: Ensure correct join conditions and leverage broadcast joins where applicable. -
Excessive Shuffling Issues
Problem: Improper shuffle partitioning degrades performance.
Solution: Dynamically adjust shuffle partitions based on data size, ensuring each partition is 100MB-200MB. -
UDF Performance Issues
Problem: Python UDFs introduce serialization overhead, reducing performance.
Solution: Replace UDFs with native Spark SQL functions or use Pandas UDFs where needed. -
Null Values Handling
Problem: Null values can cause incorrect aggregations and filtering inefficiencies.
Solution: Usefillna(),dropna(), orcoalesce()to handle missing values efficiently. -
Job Lineage Bloating
Problem: Excessive transformations and shuffling create long DAG lineage, increasing memory usage.
Solution: Persist or checkpoint intermediate results to break lineage and improve performance. -
Spark Streaming Issues
Problem: Recomputing transformations multiple times leads to inefficiency.
Solution: Cache intermediate results, optimize microbatch intervals, and use watermarking for stateful processing. -
Broadcast Variable Misuse
Problem: Broadcasting large datasets can overwhelm memory instead of improving performance.
Solution: Only broadcast small lookup datasets and usebroadcast()efficiently. -
Too Many Small Files Leading to Performance Overhead
Problem: Excessive small files generate high task overhead and I/O bottlenecks.
Solution: Reduce partitions usingcoalesce()or write output with optimized partitioning strategies. -
[Placeholder for future issue]
Problem: [Describe the problem concisely].
Solution: [Provide a recommended solution].
...and many more to come!
Contributions are welcome! If you have improvements, additional simulations, or suggestions, please open an issue or submit a pull request.
This project is licensed under the MIT License. See the LICENSE file for more details.
For any questions or support, please contact agungatidhira@gmail.com.