Spark Issue Simulation

Overview

This repository provides an end-to-end simulation of 10 common Apache Spark problems and their fixes. It includes sample code demonstrating typical errors such as Task Serialization errors, Data Skew in Joins, OutOfMemory issues, etc. along with the practical fixes. The project is intended as a learning tool to help Spark developers understand and resolve common performance and configuration issues in Spark applications.

Repository Structure

simulations/: Contains individual PySpark scripts for each of the 10 common Spark issues. Each script includes:
- The simulation code that triggers a common Spark problem.
- The corresponding fix to resolve the issue.
docker-compose.yml: A Docker Compose configuration to set up a simple Spark cluster (one master and two workers) for testing the simulations.
README.md: This documentation file.

Requirements

Docker & Docker Compose: For running the Spark cluster.
Python 3.x and PySpark: For executing the simulation scripts locally or within the Docker environment.
[Optional] Java and Scala: Required dependencies for Spark.

Setup and Running the Docker Cluster

Clone the repository:

git clone https://github.com/agungatd/Spark-issue-simulation.git
cd Spark-issue-simulation

Start the Spark cluster using Docker Compose:
```
docker-compose --profile spark up -d
```
- The Spark master UI will be available at http://localhost:9090.
(Optional) Start HDFS for some issue
```
docker-compose --profile hdfs up -d
```
- The Hadoop UI will be available at http://localhost:9870.
Then, pre-create the directory and set proper ownership and permissions. For instance:
```
docker exec namenode hdfs dfs -mkdir -p /user/spark/checkpoints
docker exec namenode hdfs dfs -chown spark:supergroup /user/spark/checkpoints
docker exec namenode hdfs dfs -chmod 777 /user/spark/checkpoints
```
Then update your Spark configuration to use this directory.
```
spark.sparkContext.setCheckpointDir("hdfs://<namenode>:<port>/user/spark/checkpoints")
```
(Replace <namenode>:<port> with your actual namenode host and port.)

Often, Spark jobs write temporary files to the user’s home directory on HDFS. Make sure that the HDFS home directory for the spark user exists and has proper permissions:
```
docker exec namenode hdfs dfs -mkdir -p /user/spark
docker exec namenode hdfs dfs -chown spark:supergroup /user/spark
```

Running the Simulations

Each simulation script can be submitted to the Spark cluster. For example, to run the Task Serialization Error simulation from your local, use:

docker exec spark-master /opt/spark/bin/spark-submit --master spark://spark-master:7077 simulations/1_data_skew.py

Replace 1_data_skew.py with the appropriate simulation file name. Ensure that the Spark master URL matches your Docker Compose configuration.

Simulation Examples

The repository covers the following simulations:

Data Skew
Problem: Uneven data distribution across partitions, causing some executors to process significantly more data than others.
Solution: Use key salting to distribute skewed data more evenly across partitions.
Out Of Memory Error
Problem: Occurs when too much data is collected to the driver, leading to crashes.
Solution:
- Use sampling for analysis instead of collecting full datasets.
- Apply aggregations and transformations at the executor level.
Complex Join Issues
Problem: Unintentional cross joins cause exponential data growth.
Solution: Ensure correct join conditions and leverage broadcast joins where applicable.
Excessive Shuffling Issues
Problem: Improper shuffle partitioning degrades performance.
Solution: Dynamically adjust shuffle partitions based on data size, ensuring each partition is 100MB-200MB.
UDF Performance Issues
Problem: Python UDFs introduce serialization overhead, reducing performance.
Solution: Replace UDFs with native Spark SQL functions or use Pandas UDFs where needed.
Null Values Handling
Problem: Null values can cause incorrect aggregations and filtering inefficiencies.
Solution: Use fillna(), dropna(), or coalesce() to handle missing values efficiently.
Job Lineage Bloating
Problem: Excessive transformations and shuffling create long DAG lineage, increasing memory usage.
Solution: Persist or checkpoint intermediate results to break lineage and improve performance.
Spark Streaming Issues
Problem: Recomputing transformations multiple times leads to inefficiency.
Solution: Cache intermediate results, optimize microbatch intervals, and use watermarking for stateful processing.
Broadcast Variable Misuse
Problem: Broadcasting large datasets can overwhelm memory instead of improving performance.
Solution: Only broadcast small lookup datasets and use broadcast() efficiently.
Too Many Small Files Leading to Performance Overhead
Problem: Excessive small files generate high task overhead and I/O bottlenecks.
Solution: Reduce partitions using coalesce() or write output with optimized partitioning strategies.
[Placeholder for future issue]
Problem: [Describe the problem concisely].
Solution: [Provide a recommended solution].

...and many more to come!

Contributing

Contributions are welcome! If you have improvements, additional simulations, or suggestions, please open an issue or submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Contact

For any questions or support, please contact agungatidhira@gmail.com.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark Issue Simulation

Overview

Repository Structure

Requirements

Setup and Running the Docker Cluster

Running the Simulations

Simulation Examples

Contributing

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
config		config
scripts		scripts
simulations		simulations
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
init-datanode.sh		init-datanode.sh
start-hdfs.sh		start-hdfs.sh

Folders and files

Latest commit

History

Repository files navigation

Spark Issue Simulation

Overview

Repository Structure

Requirements

Setup and Running the Docker Cluster

Running the Simulations

Simulation Examples

Contributing

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages