Conway's Game of Life is a cellular automaton developed by John Conway in 1970.
It evolves on a 2D grid where each cell updates based on the state of its neighbours, following 4 simple rules:
It's a single player game that runs on a grid of cells, evolving over time. The rules are:
- A live cell with fewer than 2 neighbours dies (underpopulation)
- A live cell with 2 or 3 neighbours suvives.
- A live cell with more than 3 neighbours dies (overpopulation).
- A dead cell with exactly 3 neighbours become alive (reproduction)
Despite these simple rules, the computation is not. Large boards require billions of neighbour lookups and updates across many generations.
This creates a workload dominated by memory access, data dependency, and repeated computations which makes it a classic benchmark for high-performance and parallel computing.
Each cell's update depends only on its local neighbourhood, so the computation is suitable for SIMD vectorisation, multithreading, and distributed-memory parallelism. In this project, the goal is to explore how far performance can be pushed using progressively stronger forms of optimisation.
The baseline version is a single-threaded implementation designed to establish a performance reference. Even though no parallelism is used here, the implementation incorporates optimisation features that reduce overhead and improve cache reuse. This includes:
- Padding the board to remove boundary checks.
- Row-major sequential memory access.
- Removing function call overhead by inlining functions.
- Precomputing variables.
All benchmarks were performed under a fixed and controlled configuration to ensure repeatability:
- Board shape: square grids simplified scaling analysis
- Initial density to 50% alive ensured active starting configuration
- A
SEED = 42guaranteed reproducibility across runs - 10 000 generations helped highlight differences between different implementations
All benchmarks were compiled with the O0 flag for consistent measurements, and each time taken represents the average over 5 runs.
Here are the execution times of the baseline serial version across different board sizes.
| Board Size | Total Cells | Time (seconds) |
|---|---|---|
| 128 x 128 | 16,384 | 2.54 |
| 256 x 256 | 65,536 | 10.07 |
| 512 x 512 | 262,144 | 41.25 |
| 1024 x 1024 | 1,048,576 | 164.67 |
| 2048 x 2048 | 4,194,304 | 655.98 |
As the table shows, the runtime rises sharply with board size since doubling each dimension quadruples the number of cell updates. These numbers give a clean baseline for later optimisations.
The first optimisation applied to the baseline was AVX2 SIMD vectorisation. It enables processing of 32 cells simultaneously using 256-bit __m256i registers, which hold 32 x 8-bit integers. Instead of summing each cell's 8 neighbours one at a time, the program performs 8 vectorised load and add operations, computing neighbour counts for 32 cells in parallel. The Game of Life rules are then applied without branching by using vectorised comparison masks and bitwise logic (AND / OR) to determine the cells fate.
| Board Size | Time (seconds) | Speedup |
|---|---|---|
| 128 x 128 | 0.22 | x11.55 |
| 256 x 256 | 0.97 | x10.38 |
| 512 x 512 | 4.43 | x9.31 |
| 1024 x 1024 | 17.57 | x9.54 |
| 2048 x 2048 | 68.26 | x9.61 |
The largest speedup is on the smaller board sizes because the workload on them fits much better in L1 and L2 cache. As the grid becomes larger, data spills into slower L3 or RAM.
Amdahl’s Law helps explain why AVX cannot scale linearly with vector width. Even though the SIMD unit processes many cells at once, part of the update loop remains serial, and every iteration still requires nine separate vector loads. The performance becomes memory-bound rather than compute-bound, and thus, the achievable speedup is limited and cannot approach the theoretical maximum.
OpenMP was first applied directly to the serial implmentation, without any SIMD support, to evaluate pure multithreading performance. This parallelised the row-update loop so that different threads processed different sections.
After establishing the multithreaded baseline, the threads were combined with the AVX2 kernel, allowing each thread to process a block of rows while AVX2 handled vectorised cell updates.
The table below shows the speedup relative to the serial baseline for OpenMP and a hybrid of OpenMP and AVX at different thread counts.
| Board Size | 4 Threads | 8 Threads | 16 Threads | AVX + 4 Threads | AVX + 8 Threads | AVX + 16 Threads |
|---|---|---|---|---|---|---|
| 128 x 128 | x5.47 | x6.68 | x10.41 | x23.74 | x32.79 | x24.42 |
| 256 x 256 | x3.67 | x7.05 | x12.80 | x30.61 | x53.85 | x59.24 |
| 512 x 512 | x3.68 | x7.13 | x13.52 | x33.27 | x62.88 | x100.61 |
| 1024 x 1024 | x3.64 | x7.19 | x14.06 | x34.04 | x65.58 | x106.17 |
| 2048 x 2048 | x3.7 | x7.17 | x13.67 | x34.07 | x66.37 | x128.67 |
Observations:
- Pure OpenMP shows only moderate speedups because the serial kernel is memory-bound, so extra threads do not increase bandwidth.
- AVX + OpenMP with 4 and 8 threads plateus once the grid exceeds cache capacity, where memory latency dominates
- AVX + 16 threads scales better at larger sizes because the workload is large enough to fully utilise both SIMD units and all threads
- Large grids favour the hybrid AVX + OpenMP approach, which keeps both vector lanes and CPU cores consistently busy
These scaling patterns align with both Amdahl's and Gustafson's laws. The limited speed up of pure OpenMP and the plateau of AVX + 4/8 threads reflect Amdahl's Law, where achievable performance is capped, even as more threads are added. Once the memory system is fully saturated, adding more threads provides little to no extra benefit.
In contrast, the strong scaling of AVX + 16 threads on larger grids follow Gustafson's Law: as the problem size grows, the proportion of parallel work increases while the serial fraction remains fixed. This allows the hybrid implementation to utilise both SIMD lands and all CPU cores more effectively, explaining why large grids achieve the highest speedups.
The first MPI version used MPI_Allgather with 2 processes, which pushed the 2048^2 runtime to 16.616s, almost 3 times slower than the hybrid with 16 threads. Since all ranks already maintain identical board states, the gathering added unnecessary communication overhead, so it was removed. Removing the gather improved performance but still left MPI slower than the hybrid version, likely because splitting the grid reduced cache locality and the workload became memory bound. Assuming halo exchange might be the remaining bottleneck, I replaced the collective with non-blocking communication send and receives for the halo rows. However, this version ran in 5.715s and behaved exactly like blocking communication, suggesting that the interior-row computation finished before halo transfers completed, leaving no opportunity for overlap. Since both ranks were running on the same node with very low latency, non-blocking communication provided no measurable advantage.
The cluster also allowed up to 3 MPI processes, and this configuration was tested, yielding noticeably better performance than 2 processes.
Below is the table comparing 2‑process and 3‑process MPI configurations (each using up to 8 threads per process):
| Board Size | 2 processes + hybrid 8 | 3 processes + hybrid 8 |
|---|---|---|
| 128 x 128 | x4.83 | x4.28 |
| 256 x 256 | x16.73 | x17.36 |
| 512 x 512 | x46.61 | x50.61 |
| 1024 x 1024 | x92.36 | x120.28 |
| 2048 x 2048 | x117.31 | x182.78 |
The cluster limited each MPI process to a max of 8 threads, which constrained the achievable performance. For testing, each rank printed its own section of the board, and a Python script stitched these segments together to confirm that the update rules were applied properly.
For the smaller grids, the amount of actual computation per process is small, MPI overhead dominates because the computation per process is too small, making communication costs proportionally similar and resulting in almost identical speedups across the 2 benchmarks.
For the 1024 and 2048 grids, the computation becomes large enough that splitting the board across 3 processes improves cache locality much more than 2 processes. Each rank receives a smaller subgrid, which means each process works on a dataset that better fits into L2/L3 cache. Another effect is that, with three ranks, the amount of halo data each process must exchange is smaller relative to its workload, reducing the communication overhead and so it is no longer the dominant bottleneck. So 3 processes perform much better on large boards because the memory subsystem is used more efficiently, and the communication overhead remains small.
Overall, small and mid-sized grids run best on the 16 thread hybrid because it avoids communication and maximises the use of cache. For large grids, splitting the work across 3 MPI processes gives the best performance since each process gets a smaller, cache friendly dataset. The best method depends entirely on grid size. In future work, I would explore offloading the computation to the GPU using CUDA. It would shift the bottleneck away from CPU cache behaviour to the GPU, where threads can update cells concurrently.
- Python (>=3.8)
- NumPy
- SciPy
- Pillow (PIL Image)
To turn frames into video animation:
ffmpeg -framerate 20 -i frame_%05d.pgm \
-vf "scale=iw*6:ih*6:flags=neighbor" \
-pix_fmt yuv420p gol.mp4