Getting Started • Documentation • Download • Features • Architecture • Contributing
A comprehensive, hardware-agnostic GPU benchmarking suite that compares CUDA, OpenCL, and DirectCompute performance using identical workloads. Built from scratch with professional architecture, extensive documentation, and production-ready GUI.
🚀 Quick Start | 📖 Read the Docs | 💻 View Source | 🐛 Report Issues
- Overview
- Why This Project?
- Features
- Getting Started
- The Three APIs Explained
- The Four Benchmarks
- How It Works
- Architecture
- Project Structure
- Tools & Technologies
- Usage Guide
- Understanding Output
- Performance Expectations
- Build System
- Challenges Conquered
- Future Roadmap
- Documentation
- API References
- Contributing
- License
- Author
- Acknowledgments
GPU Benchmark Suite is a professional-grade, cross-API GPU performance testing application designed to:
- Compare GPU APIs fairly - Run identical algorithms on CUDA, OpenCL, and DirectCompute
- Measure real performance - Bandwidth (GB/s), throughput (GFLOPS), execution time
- Work on any GPU - NVIDIA, AMD, Intel - hardware-agnostic design
- Visualize results - Real-time graphs, historical tracking, CSV export
- Educate developers - Comprehensive documentation, clean code, learning resource
- ✅ Three GPU APIs - CUDA, OpenCL, DirectCompute in one application
- ✅ Four Benchmark Types - Memory, compute, mixed, synchronization workloads
- ✅ Professional GUI - ImGui-based interface with real-time visualization
- ✅ Hardware Agnostic - Same exe works on NVIDIA, AMD, Intel GPUs
- ✅ Verified Results - Every benchmark verified against CPU reference
- ✅ Extensive Documentation - 10,000+ lines of documentation and comments
- ✅ Production Ready - Professional branding, icon integration, error handling
- ✅ Open Source - Learn from working code, contribute improvements
Modern computing relies heavily on GPUs for:
- Machine Learning (TensorFlow, PyTorch)
- Scientific Simulation (weather, molecular dynamics)
- Image/Video Processing (Premiere, Blender)
- Data Analytics (RAPIDS, GPU databases)
- Cryptocurrency Mining
But how do you objectively measure GPU performance across different hardware and APIs?
A unified benchmarking tool that:
- Tests the same workload on CUDA, OpenCL, and DirectCompute
- Runs on any GPU - NVIDIA, AMD, Intel
- Provides real metrics - Not synthetic scores, actual GB/s and GFLOPS
- Verifies correctness - Fast wrong answers are useless
- Presents professionally - GUI application, graphs, CSV export
CUDA (70% market share)
- Industry standard for GPU compute
- Best performance, most mature
- NVIDIA-only but dominates professional computing
OpenCL (Cross-vendor)
- Works on NVIDIA, AMD, Intel, ARM
- Open standard (Khronos Group)
- Cross-platform portability
DirectCompute (Windows native)
- Part of DirectX, always available
- Game engine integration
- Zero dependencies on Windows
→ Detailed explanation: docs/WHY_THIS_PROJECT.md
Each benchmark tests a different aspect of GPU performance:
| Benchmark | Tests | Real-World Use |
|---|---|---|
| Vector Add | Memory Bandwidth | Data preprocessing, array operations |
| Matrix Mul | Compute Throughput | Neural networks (95% of AI compute) |
| Convolution | Mixed Workload | Image processing, CNNs |
| Reduction | Synchronization | Analytics, aggregation, statistics |
→ Detailed explanation: docs/WHY_THIS_PROJECT.md#why-these-4-benchmarks
- ImGui-based interface - Fast, responsive, modern design
- Real-time progress - Live progress bar during benchmarks
- Performance graphs - Line charts showing bandwidth/GFLOPS over time
- History tracking - Stores up to 100 test results with timestamps
- Test indexing - "Test 1", "Test 2", etc. with date/time
- CSV export - Save results for analysis in Excel/Python
- CUDA - Full implementation with 4 benchmarks
- OpenCL - Cross-vendor support (NVIDIA/AMD/Intel)
- DirectCompute - Native Windows GPU compute
- Runtime detection - Automatically detects available APIs
- Graceful degradation - Uses what's available, reports what's not
- Vector Addition - Pure memory bandwidth test
- Matrix Multiplication - Compute-intensive workload
- 2D Convolution - Image processing simulation
- Parallel Reduction - Inter-thread communication test
- GPU-side timing - Uses CUDA events, OpenCL profiling, D3D11 queries
- Warmup runs - Stabilizes GPU clocks before measurement
- Multiple iterations - Averages multiple runs for accuracy
- Result verification - Compares GPU output vs CPU reference
- Real-time graphs - See performance as tests run
- Historical data - Compare current run vs previous runs
- Multiple metrics - Bandwidth (GB/s), Throughput (GFLOPS), Time (ms)
- Color-coded - Different colors for each benchmark type
- Clean architecture - Design patterns (Strategy, Factory, Singleton, RAII)
- Extensive documentation - Every function explained
- CMake build system - Cross-platform build configuration
- Unit tests - 9 test executables validate components
- Error handling - Robust error checking, never crashes
- Download: Get
GPU-Benchmark-GUI.exefrombuild/Release/ - Run: Double-click the executable
- Select: Choose your GPU API (CUDA/OpenCL/DirectCompute)
- Benchmark: Click "Run Benchmark" and wait ~30 seconds
- Analyze: View results in graphs and table
→ Complete setup guide: docs/GETTING_STARTED.md
GPU-Benchmark/
└── build/
└── Release/
└── GPU-Benchmark-GUI.exe ← Run this!
Or use the launch script:
scripts\launch\RUN_GUI.cmdRequirements:
- Windows 10/11 (64-bit)
- Visual Studio 2022
- CUDA Toolkit 12.x (for NVIDIA GPUs)
- CMake 3.18+
Build steps:
# 1. Clone repository
git clone https://github.com/davesohamm/GPU-Benchmark.git
cd GPU-Benchmark
# 2. Open Developer Command Prompt for VS 2022
# 3. Download ImGui
scripts\build\DOWNLOAD_IMGUI.cmd
# 4. Build project
scripts\build\BUILD.cmd
# 5. Run GUI
scripts\launch\RUN_GUI.cmd→ Detailed build guide: docs/build-setup/BUILD_GUIDE.md
What is it?
- NVIDIA's proprietary GPU programming platform
- Industry standard (70%+ of professional GPU compute)
- Most mature ecosystem (cuDNN, cuBLAS, Thrust, etc.)
Strengths:
- ✅ Best performance (highly optimized drivers)
- ✅ Richest library ecosystem
- ✅ Excellent documentation and tools
- ✅ Tensor Core support (AI acceleration)
Limitations:
- ❌ NVIDIA GPUs only
- ❌ Vendor lock-in
Our Implementation:
- File:
src/backends/cuda/CUDABackend.cpp - Kernels:
src/backends/cuda/kernels/*.cu - Uses: CUDA Runtime API, cudaEvents for timing
- Optimizations: Shared memory, warp shuffles, coalescing
Code Example:
__global__ void vectorAddKernel(const float* a, const float* b, float* c, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
c[idx] = a[idx] + b[idx];
}
}What is it?
- Open standard by Khronos Group (same org as Vulkan, OpenGL)
- Cross-vendor: NVIDIA, AMD, Intel, ARM, FPGAs
- Cross-platform: Windows, Linux, macOS, Android
Strengths:
- ✅ Hardware agnostic (works on any GPU)
- ✅ No vendor lock-in
- ✅ Heterogeneous computing (CPU+GPU+FPGA)
- ✅ Runtime compilation (optimize for specific hardware)
Limitations:
- ❌ More verbose API (more boilerplate)
- ❌ Slightly lower performance than native APIs
- ❌ Varies more across vendors
Our Implementation:
- File:
src/backends/opencl/OpenCLBackend.cpp - Kernels: Embedded as strings in source code
- Uses: OpenCL 3.0 API, cl_events for profiling
- Features: Runtime compilation, platform detection
Code Example:
__kernel void vectorAdd(
__global const float* a,
__global const float* b,
__global float* c,
int n)
{
int gid = get_global_id(0);
if (gid < n) {
c[gid] = a[gid] + b[gid];
}
}What is it?
- Microsoft's GPU compute API (part of DirectX 11/12)
- Native to Windows, always available
- Uses HLSL (High-Level Shading Language)
Strengths:
- ✅ Zero dependencies (comes with Windows)
- ✅ Direct integration with graphics pipeline
- ✅ Used in game engines (Unity, Unreal, CryEngine)
- ✅ HLSL syntax familiar to graphics programmers
Limitations:
- ❌ Windows only
- ❌ Slightly lower performance than CUDA
- ❌ Less mature compute ecosystem
Our Implementation:
- File:
src/backends/directcompute/DirectComputeBackend.cpp - Shaders:
src/backends/directcompute/shaders/*.hlsl - Uses: DirectX 11 API, ID3D11Query for timing
- Features: Structured buffers, UAVs, constant buffers
Code Example:
[numthreads(256, 1, 1)]
void CSMain(uint3 DTid : SV_DispatchThreadID)
{
uint idx = DTid.x;
if (idx < size) {
output[idx] = input1[idx] + input2[idx];
}
}→ Detailed API comparison: docs/WHY_THIS_PROJECT.md#why-these-3-apis
What it does:
C[i] = A[i] + B[i] for i = 0 to N
What it tests:
- Primary: Memory bandwidth (how fast data moves)
- Secondary: Memory coalescing efficiency
Why it matters:
- Simplest GPU operation - great for learning
- Memory-bound workload (limited by DRAM speed, not compute)
- Reveals peak memory bandwidth of your GPU
Real-world applications:
- Data preprocessing in ML pipelines
- Array operations (NumPy/MATLAB equivalents)
- Financial calculations (portfolio evaluation)
Performance metrics:
- Bandwidth (GB/s): Main metric
- Efficiency: % of theoretical peak bandwidth
Expected performance (RTX 3050):
- Theoretical: 224 GB/s (GDDR6 spec)
- Achieved: ~180 GB/s (80% efficiency - good!)
→ Kernel implementation: src/backends/cuda/kernels/vector_add.cu
What it does:
C[m][n] = Σ A[m][k] * B[k][n] for k = 0 to K
What it tests:
- Primary: Compute throughput (GFLOPS)
- Secondary: Memory hierarchy efficiency (cache usage)
Why it matters:
- Most important operation in AI/ML (95% of deep learning)
- Compute-intensive (billions of floating-point ops)
- Showcases optimization techniques (naive → optimized)
Real-world applications:
- Deep Learning: Every neural network layer
- 3D Graphics: Transformation matrices
- Scientific Computing: Linear algebra, PDE solvers
- Signal Processing: Filter banks, FFT
Optimization levels:
- Naive (~100 GFLOPS) - Global memory only
- Tiled (~500 GFLOPS) - Shared memory optimization
- Optimized (~1000 GFLOPS) - Register blocking + vectorization
Performance metrics:
- GFLOPS: Main metric (billions of FLOPs/sec)
- Efficiency: % of theoretical peak compute
Expected performance (RTX 3050):
- Theoretical: 9.1 TFLOPS (FP32)
- Achieved: ~1-2 TFLOPS (10-20% - realistic for general matmul)
→ Kernel implementation: src/backends/cuda/kernels/matrix_mul.cu
What it does:
Output[x][y] = Σ Σ Input[x+dx][y+dy] * Kernel[dx][dy]
What it tests:
- Primary: Balanced memory + compute
- Secondary: Irregular memory access patterns
Why it matters:
- Core of Convolutional Neural Networks (CNNs)
- Common in image processing
- Tests GPU's ability to handle halo regions
Real-world applications:
- Image Processing: Blur, sharpen, edge detection
- Computer Vision: CNNs (ResNet, VGG, YOLO)
- Medical Imaging: CT/MRI reconstruction
- Video Processing: Real-time filters
Optimization techniques:
- Naive - Read from global memory each time
- Shared Memory - Load tile with halo region
- Constant Memory - Store filter kernel in constant cache
- Separable Filters - 2D conv as two 1D passes
Performance characteristics:
- Highly dependent on image size and kernel size
- Larger kernels need more memory bandwidth
- Smaller kernels are more compute-bound
→ Kernel implementation: src/backends/cuda/kernels/convolution.cu
What it does:
Sum = A[0] + A[1] + A[2] + ... + A[N-1]
What it tests:
- Primary: Inter-thread synchronization
- Secondary: Shared memory bank conflicts
Why it matters:
- Classic parallel algorithm
- Tests GPU's synchronization primitives
- Shows optimization evolution (5 implementations!)
Real-world applications:
- Analytics: Sum, mean, variance, statistics
- Machine Learning: Loss calculation, gradient aggregation
- Scientific Computing: Numerical integration
- Database: Aggregation queries (SUM, AVG, COUNT)
Optimization ladder:
- Naive (~50 GB/s) - Basic approach
- Sequential Addressing (~80 GB/s) - Avoid warp divergence
- Bank Conflict Free (~120 GB/s) - Offset access patterns
- Warp Shuffle (~180 GB/s) - Intra-warp communication
- Atomic Operations (~200 GB/s) - Final aggregation
What you learn:
- Warp divergence impact
- Shared memory bank conflicts
- Thread synchronization (
__syncthreads()) - Modern warp-level primitives (
__shfl_down_sync())
→ Kernel implementation: src/backends/cuda/kernels/reduction.cu
→ Detailed benchmark explanation: docs/WHY_THIS_PROJECT.md#why-these-4-benchmarks
┌─────────────────────────────────────────────────────────┐
│ 1. STARTUP │
│ ├─ Initialize DirectX 11 for GUI rendering │
│ ├─ Load ImGui framework │
│ ├─ Detect system capabilities │
│ │ ├─ Query CUDA availability │
│ │ ├─ Query OpenCL availability │
│ │ ├─ Query DirectCompute availability │
│ │ └─ Get GPU information (DXGI) │
│ └─ Display main window │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ 2. USER INTERACTION │
│ ├─ User selects backend (CUDA/OpenCL/DirectCompute) │
│ ├─ User selects suite (Quick/Standard/Comprehensive)│
│ └─ User clicks "Run Benchmark" │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ 3. WORKER THREAD SPAWNED │
│ └─ Keeps GUI responsive while benchmarking │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ 4. BENCHMARK EXECUTION (for each of 4 benchmarks) │
│ ├─ Setup Phase │
│ │ ├─ Allocate host memory (CPU) │
│ │ ├─ Initialize test data │
│ │ ├─ Calculate CPU reference results │
│ │ ├─ Allocate device memory (GPU) │
│ │ └─ Copy data to GPU │
│ │ │
│ ├─ Warmup Phase (3 iterations) │
│ │ ├─ Execute kernel │
│ │ ├─ Synchronize │
│ │ └─ (Stabilizes GPU clocks) │
│ │ │
│ ├─ Measurement Phase (10 iterations) │
│ │ ├─ Start GPU timer │
│ │ ├─ Execute kernel │
│ │ ├─ Synchronize GPU │
│ │ ├─ Stop GPU timer │
│ │ └─ Record time (average of iterations) │
│ │ │
│ ├─ Verification Phase │
│ │ ├─ Copy results back from GPU │
│ │ ├─ Compare GPU output vs CPU reference │
│ │ └─ Report if results match (within epsilon) │
│ │ │
│ ├─ Metrics Calculation │
│ │ ├─ Bandwidth (GB/s) = bytes / time │
│ │ ├─ Throughput (GFLOPS) = operations / time │
│ │ └─ Efficiency (%) = achieved / theoretical │
│ │ │
│ └─ GUI Update │
│ ├─ Update progress bar │
│ ├─ Add result to history │
│ └─ Refresh graphs │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ 5. RESULTS DISPLAY │
│ ├─ Show all 4 benchmark results │
│ ├─ Display performance graphs │
│ ├─ Update historical data │
│ └─ Enable CSV export │
└─────────────────────────────────────────────────────────┘
// 1. Initialize
cudaSetDevice(0);
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
// 2. Allocate memory
float* d_a, *d_b, *d_c;
cudaMalloc(&d_a, size * sizeof(float));
cudaMalloc(&d_b, size * sizeof(float));
cudaMalloc(&d_c, size * sizeof(float));
// 3. Copy data
cudaMemcpy(d_a, h_a, size * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, size * sizeof(float), cudaMemcpyHostToDevice);
// 4. Launch kernel with timing
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
vectorAddKernel<<<gridSize, blockSize>>>(d_a, d_b, d_c, size);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
// 5. Copy result back
cudaMemcpy(h_c, d_c, size * sizeof(float), cudaMemcpyDeviceToHost);
// 6. Cleanup
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);// 1. Platform & device selection
clGetPlatformIDs(1, &platform, NULL);
clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
// 2. Context & queue
context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);
queue = clCreateCommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE, NULL);
// 3. Compile kernel
const char* source = "...kernel code...";
program = clCreateProgramWithSource(context, 1, &source, NULL, NULL);
clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
kernel = clCreateKernel(program, "vectorAdd", NULL);
// 4. Create buffers
cl_mem d_a = clCreateBuffer(context, CL_MEM_READ_ONLY, size, NULL, NULL);
cl_mem d_b = clCreateBuffer(context, CL_MEM_READ_ONLY, size, NULL, NULL);
cl_mem d_c = clCreateBuffer(context, CL_MEM_WRITE_ONLY, size, NULL, NULL);
// 5. Copy data
clEnqueueWriteBuffer(queue, d_a, CL_TRUE, 0, size, h_a, 0, NULL, NULL);
clEnqueueWriteBuffer(queue, d_b, CL_TRUE, 0, size, h_b, 0, NULL, NULL);
// 6. Set arguments & execute
clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_a);
clSetKernelArg(kernel, 1, sizeof(cl_mem), &d_b);
clSetKernelArg(kernel, 2, sizeof(cl_mem), &d_c);
cl_event event;
size_t globalSize = size;
clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &globalSize, NULL, 0, NULL, &event);
clWaitForEvents(1, &event);
// 7. Get timing
cl_ulong start, end;
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(start), &start, NULL);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(end), &end, NULL);
double milliseconds = (end - start) / 1e6;
// 8. Copy result
clEnqueueReadBuffer(queue, d_c, CL_TRUE, 0, size, h_c, 0, NULL, NULL);// 1. Create D3D11 device
D3D11CreateDevice(NULL, D3D_DRIVER_TYPE_HARDWARE, NULL, 0, NULL, 0,
D3D11_SDK_VERSION, &device, NULL, &context);
// 2. Compile shader
D3DCompile(hlslSource, strlen(hlslSource), "shader.hlsl", NULL, NULL,
"CSMain", "cs_5_0", 0, 0, &shaderBlob, NULL);
device->CreateComputeShader(shaderBlob->GetBufferPointer(),
shaderBlob->GetBufferSize(), NULL, &computeShader);
// 3. Create buffers
D3D11_BUFFER_DESC desc = { size, D3D11_USAGE_DEFAULT,
D3D11_BIND_UNORDERED_ACCESS, 0,
D3D11_RESOURCE_MISC_BUFFER_STRUCTURED, sizeof(float) };
device->CreateBuffer(&desc, NULL, &bufferA);
device->CreateBuffer(&desc, NULL, &bufferB);
device->CreateBuffer(&desc, NULL, &bufferC);
// 4. Copy data
context->UpdateSubresource(bufferA, 0, NULL, h_a, 0, 0);
context->UpdateSubresource(bufferB, 0, NULL, h_b, 0, 0);
// 5. Create UAVs
device->CreateUnorderedAccessView(bufferA, &uavDesc, &uavA);
device->CreateUnorderedAccessView(bufferB, &uavDesc, &uavB);
device->CreateUnorderedAccessView(bufferC, &uavDesc, &uavC);
// 6. Set shader & dispatch
context->CSSetShader(computeShader, NULL, 0);
ID3D11UnorderedAccessView* uavs[] = { uavA, uavB, uavC };
context->CSSetUnorderedAccessViews(0, 3, uavs, NULL);
context->Dispatch((size + 255) / 256, 1, 1);
// 7. Copy result
D3D11_MAPPED_SUBRESOURCE mapped;
context->Map(bufferC, 0, D3D11_MAP_READ, 0, &mapped);
memcpy(h_c, mapped.pData, size);
context->Unmap(bufferC, 0);→ Detailed internal workings: docs/INTERNAL_WORKINGS.md
┌──────────────────────────────────────────────────────────────┐
│ PRESENTATION LAYER │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ GUI Application (ImGui + DirectX 11) │ │
│ │ - User Interface │ │
│ │ - Real-time Graphs │ │
│ │ - Progress Display │ │
│ └─────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────────────┐
│ ORCHESTRATION LAYER │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Benchmark Runner │ │
│ │ - Coordinates execution │ │
│ │ - Manages worker thread │ │
│ │ - Aggregates results │ │
│ └─────────────────────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Device Discovery │ │
│ │ - Detects available APIs │ │
│ │ - Queries GPU information │ │
│ └─────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────────────┐
│ ABSTRACTION LAYER │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ IComputeBackend Interface │ │
│ │ - Initialize() / Shutdown() │ │
│ │ - AllocateMemory() / FreeMemory() │ │
│ │ - CopyHostToDevice() / CopyDeviceToHost() │ │
│ │ - ExecuteKernel() │ │
│ │ - Synchronize() / StartTimer() / GetElapsedTime() │ │
│ └─────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────────────┐
│ IMPLEMENTATION LAYER │
│ ┌──────────────┐ ┌───────────────┐ ┌──────────────────┐ │
│ │ CUDABackend │ │ OpenCLBackend │ │DirectComputeBack.│ │
│ │ │ │ │ │ │ │
│ │ CUDA Runtime │ │ OpenCL 3.0 │ │ DirectX 11 │ │
│ │ cudaEvents │ │ cl_events │ │ ID3D11Query │ │
│ │ .cu kernels │ │ .cl kernels │ │ .hlsl shaders │ │
│ └──────────────┘ └───────────────┘ └──────────────────┘ │
└──────────────────────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────────────┐
│ HARDWARE LAYER │
│ GPU Driver → GPU Hardware │
└──────────────────────────────────────────────────────────────┘
- Strategy Pattern - Different backends (CUDA/OpenCL/DirectCompute) implement same interface
- Factory Pattern - Backend creation based on runtime capability
- Singleton Pattern - Logger, device discovery
- Facade Pattern - BenchmarkRunner simplifies complex operations
- RAII Pattern - Automatic resource cleanup in destructors
- Template Method - Benchmark base class defines workflow
→ Complete architecture documentation: docs/ARCHITECTURE.md
GPU-Benchmark/
│
├── 📄 README.md ← You are here!
├── 📄 CMakeLists.txt ← Build configuration
├── 📄 .gitattributes ← Git configuration
│
├── 📂 src/ ← Source Code (50+ files)
│ ├── 📂 core/ ← Core Framework
│ │ ├── IComputeBackend.h → Backend interface
│ │ ├── Timer.h/cpp → High-resolution timing
│ │ ├── Logger.h/cpp → Logging and CSV export
│ │ ├── DeviceDiscovery.h/cpp → GPU detection
│ │ ├── BenchmarkRunner.h/cpp → Orchestration
│ │ └── README.md → Core documentation
│ │
│ ├── 📂 backends/ ← GPU API Implementations
│ │ ├── 📂 cuda/ → NVIDIA CUDA
│ │ │ ├── CUDABackend.h/cpp
│ │ │ ├── README.md
│ │ │ └── kernels/
│ │ │ ├── vector_add.cu
│ │ │ ├── matrix_mul.cu
│ │ │ ├── convolution.cu
│ │ │ └── reduction.cu
│ │ ├── 📂 opencl/ → Cross-vendor OpenCL
│ │ │ ├── OpenCLBackend.h/cpp
│ │ │ ├── README.md
│ │ │ └── kernels/
│ │ │ ├── vector_add.cl
│ │ │ ├── matrix_mul.cl
│ │ │ ├── convolution.cl
│ │ │ └── reduction.cl
│ │ └── 📂 directcompute/ → Windows DirectCompute
│ │ ├── DirectComputeBackend.h/cpp
│ │ ├── README.md
│ │ └── shaders/
│ │ ├── vector_add.hlsl
│ │ ├── matrix_mul.hlsl
│ │ ├── convolution.hlsl
│ │ └── reduction.hlsl
│ │
│ ├── 📂 benchmarks/ ← Benchmark Wrapper Classes
│ │ ├── VectorAddBenchmark.h/cpp
│ │ ├── MatrixMulBenchmark.h/cpp
│ │ ├── ConvolutionBenchmark.h/cpp
│ │ └── ReductionBenchmark.h/cpp
│ │
│ ├── 📂 gui/ ← GUI Application
│ │ ├── main_gui_fixed.cpp → Main GUI code
│ │ └── app.rc → Windows resources (icon, version)
│ │
│ ├── main_working.cpp ← CLI application
│ ├── cuda_stub.cu ← CUDA linker stub
│ └── simple_benchmark.h/cpp ← Simple benchmark helpers
│
├── 📂 docs/ ← Documentation Hub (10,000+ lines)
│ ├── README.md → Documentation index
│ ├── ARCHITECTURE.md → System architecture (detailed)
│ ├── PROJECT_SUMMARY.md → Project overview
│ ├── WHY_THIS_PROJECT.md → Philosophy and motivation
│ ├── GETTING_STARTED.md → Complete setup guide
│ ├── INTERNAL_WORKINGS.md → How everything works internally
│ ├── API_REFERENCES.md → Learning resources & links
│ ├── README_ORGANIZATION.md → Repository structure guide
│ ├── REPOSITORY_STRUCTURE.md → Detailed file organization
│ ├── ORGANIZATION_COMPLETE.txt → Organization summary
│ │
│ ├── 📂 dev-progress/ → Development Milestones (23 files)
│ │ ├── COMPLETE_IMPLEMENTATION.md
│ │ ├── FEATURES_COMPLETED.md
│ │ ├── THREE_BACKENDS_COMPLETE.md
│ │ └── ...
│ │
│ ├── 📂 bug-fixes/ → Bug Fix Documentation (11 files)
│ │ ├── ALL_8_ISSUES_FIXED.md
│ │ ├── FIXES_COMPLETED_ROUND2.md
│ │ ├── CRASH_ISSUE_FIXED.md
│ │ └── ...
│ │
│ ├── 📂 build-setup/ → Build Instructions (8 files)
│ │ ├── BUILD_GUIDE.md
│ │ ├── FRESH_START_WITH_VS2022.md
│ │ ├── SETUP_IMGUI_MANUAL.md
│ │ └── ...
│ │
│ └── 📂 user-guides/ → User Documentation (8 files)
│ ├── START_HERE.md
│ ├── HOW_TO_USE_GUI.md
│ ├── QUICKSTART.md
│ └── ...
│
├── 📂 tests/ ← Testing Framework
│ ├── README.md
│ ├── 📂 unit-tests/ → Component Tests (9 files)
│ │ ├── test_cuda_backend.cu
│ │ ├── test_opencl_backend.cpp
│ │ ├── test_directcompute_backend.cpp
│ │ ├── test_matmul.cu
│ │ ├── test_convolution.cu
│ │ ├── test_reduction.cu
│ │ └── ...
│ │
│ └── 📂 test-scripts/ → Test Automation (18 scripts)
│ ├── RUN_ALL_TESTS.cmd
│ ├── TEST_COMPLETE_SUITE.cmd
│ └── ...
│
├── 📂 scripts/ ← Build & Launch Scripts
│ ├── README.md
│ ├── 📂 build/ → Build Automation (4 scripts)
│ │ ├── BUILD.cmd → Main build script
│ │ ├── REBUILD_FIXED.cmd
│ │ ├── check_setup.ps1
│ │ └── DOWNLOAD_IMGUI.cmd
│ │
│ ├── 📂 launch/ → Application Launchers (4 scripts)
│ │ ├── RUN_GUI.cmd → Launch GUI
│ │ ├── LAUNCH_GUI.cmd
│ │ └── ...
│ │
│ └── SHOW_STRUCTURE.cmd → Display repository structure
│
├── 📂 release/ ← Release Documentation
│ ├── README.md
│ ├── PRODUCTION_READY_v1.0.txt → Production status
│ ├── RELEASE_v1.0_READY.md → Release notes
│ ├── DISTRIBUTION_PACKAGE.md → Distribution guide
│ ├── ICON_FIX_COMPLETE.md → Icon integration details
│ └── VERIFY_RELEASE.cmd → Release verification script
│
├── 📂 results/ ← Benchmark Results
│ ├── README.md
│ └── *.csv → CSV exports
│
├── 📂 assets/ ← Application Assets
│ ├── icon.png → PNG icon (source)
│ └── icon.ico → ICO icon (embedded in exe)
│
├── 📂 build/ ← Build Output (generated)
│ └── Release/
│ ├── GPU-Benchmark-GUI.exe → 🎯 MAIN EXECUTABLE
│ ├── GPU-Benchmark.exe → CLI version
│ └── test_*.exe → Unit tests
│
└── 📂 external/ ← Third-Party Libraries
└── imgui/ → ImGui GUI framework
├── imgui.h/cpp
├── backends/
└── ...
| Directory | Purpose | File Count |
|---|---|---|
src/ |
Source code | 50+ files |
docs/ |
Documentation | 60+ files (10,000+ lines) |
tests/ |
Unit tests & scripts | 27 files |
scripts/ |
Build & launch automation | 12 scripts |
build/Release/ |
Compiled executables | 10+ executables |
assets/ |
Icons, images | 2 files |
external/ |
Third-party libs (ImGui) | 215 files |
→ Complete structure guide: docs/REPOSITORY_STRUCTURE.md
- C++17 - Main application language
- CUDA C++ - NVIDIA GPU kernels
- OpenCL C - Cross-vendor GPU kernels
- HLSL - DirectCompute compute shaders
- CMake - Build system configuration
- Batch/PowerShell - Automation scripts
- CUDA 12.x - NVIDIA GPU programming
- OpenCL 3.0 - Cross-platform GPU compute
- DirectX 11 - Windows GPU compute & rendering
- ImGui 1.89 - Immediate mode GUI framework
- Windows API - Window creation, file dialogs, system queries
- CMake 3.18+ - Build configuration generator
- Visual Studio 2022 - C++ compiler (MSVC)
- NVCC - NVIDIA CUDA compiler
- FXC - HLSL shader compiler
- RC.exe - Windows resource compiler
- Git - Version control
- Visual Studio 2022 - IDE
- Nsight Compute - CUDA profiler (optional)
- GPU-Z - GPU monitoring (optional)
- STL - C++ Standard Library (vector, string, chrono, thread, atomic, mutex)
- Windows SDK - Windows API headers
- DXGI - DirectX Graphics Infrastructure (GPU enumeration)
- Strategy - Backend abstraction
- Factory - Backend creation
- Singleton - Logger, device discovery
- Facade - Benchmark runner
- RAII - Automatic resource management
- Template Method - Benchmark workflow
- C++17 - Modern C++ features (structured bindings, if constexpr, std::optional)
- CUDA C++17 - CUDA with C++17 features
- OpenCL 3.0 - Latest OpenCL specification
- Shader Model 5.0 - DirectCompute compute shaders
scripts\launch\RUN_GUI.cmdOr double-click: build\Release\GPU-Benchmark-GUI.exe
Look at the top section:
CUDA: ✅ Available (NVIDIA RTX 3050)
OpenCL: ✅ Available (v3.0)
DirectCompute: ✅ Available (DirectX 11.1)
Click radio button:
- CUDA - Best performance (NVIDIA only)
- OpenCL - Cross-vendor (works on AMD/Intel too)
- DirectCompute - Always available on Windows
- Quick (10M elements) - ~10 seconds
- Standard (50M elements) - ~30 seconds
- Comprehensive (100M elements) - ~60 seconds
Click "Run Benchmark" button
Watch:
- Progress bar fills (0% → 100%)
- Results appear in real-time
- Graphs update with each benchmark
Performance Graphs:
- VectorAdd - Memory bandwidth test
- MatrixMul - Compute performance test
- Convolution - Mixed workload test
- Reduction - Synchronization test
Each graph shows:
- Blue line - Performance over time
- Y-axis - Bandwidth (GB/s) or GFLOPS
- X-axis - Test number
- Hover tooltip - Exact values
Current Results Table: Shows latest run with:
- Bandwidth (GB/s)
- GFLOPS
- Time (ms)
- Click "Export CSV" button
- Choose save location
- Enter filename
- Click "Save"
File format:
Backend,Benchmark,Bandwidth(GB/s),GFLOPS,Time(ms),Timestamp
CUDA,VectorAdd,182.4,0.0,0.82,2026-01-09 14:30:45
...Run 1: CUDA
- Select CUDA
- Run benchmark
- Note results
Run 2: OpenCL
- Select OpenCL
- Run benchmark
- Compare graphs (CUDA history vs OpenCL history)
Run 3: DirectCompute
- Select DirectCompute
- Run benchmark
- Export all three to CSV for analysis
- Accumulates over time - Each run adds a data point
- Indexed - "Test 1", "Test 2", etc.
- Timestamped - Hover to see date/time
- Separate per backend - CUDA history ≠ OpenCL history
- Stores 100 tests - Older tests removed automatically
1. Open CSV in Excel
2. Create PivotTable
3. Rows: Backend, Columns: Benchmark
4. Values: Average of Bandwidth
5. Insert Chart → Bar Chart
import pandas as pd
import matplotlib.pyplot as plt
# Load data
df = pd.read_csv('results.csv')
# Plot bandwidth by backend
df.groupby(['Backend', 'Benchmark'])['Bandwidth(GB/s)'].mean().unstack().plot(kind='bar')
plt.title('GPU Bandwidth Comparison')
plt.ylabel('GB/s')
plt.show()
# Statistics
print(df.groupby('Backend')['Bandwidth(GB/s)'].describe())Problem: CUDA shows "Not Available"
- Install NVIDIA drivers
- Install CUDA Toolkit
- Restart computer
Problem: OpenCL shows "Not Available"
- Reinstall GPU drivers
- Check if GPU supports OpenCL 1.2+
Problem: Application crashes
- Update GPU drivers
- Check Windows Event Viewer
- Run as Administrator
Problem: Low performance
- Close other GPU applications
- Check GPU temperature (thermal throttling?)
- Try "Quick" suite first
→ Complete troubleshooting: docs/GETTING_STARTED.md#troubleshooting
What it is: Data transfer rate (gigabytes per second)
Formula: Bandwidth = Bytes Processed / Time
Interpretation:
- Higher = Better
- Measures memory system performance
- Limited by DRAM speed (not compute)
Typical values:
- RTX 3050: 150-200 GB/s
- RTX 3090: 800-900 GB/s
- A100: 1500-2000 GB/s
What it is: Compute throughput (billion floating-point operations per second)
Formula: GFLOPS = Operations / Time / 1e9
Interpretation:
- Higher = Better
- Measures compute performance
- Limited by ALU speed
Typical values:
- RTX 3050: 800-1200 GFLOPS (matmul)
- RTX 3090: 20,000-30,000 GFLOPS
- A100: 60,000-80,000 GFLOPS
What it is: Execution time in milliseconds
Interpretation:
- Lower = Better
- GPU-side timing (excludes host overhead)
- Measured using CUDA events / OpenCL profiling / D3D11 queries
What it is: Percentage of theoretical peak performance
Formula: Efficiency = (Achieved / Theoretical) * 100
Interpretation:
- 80%+ = Excellent
- 60-80% = Good
- 40-60% = Acceptable
- < 40% = Room for optimization
VectorAdd, Reduction
Limited by memory bandwidth, not compute.
Key metric: Bandwidth (GB/s)
Optimization focus:
- Coalesced memory access
- Reduce memory transfers
- Maximize memory bus utilization
MatrixMul
Limited by compute units, not memory.
Key metric: GFLOPS
Optimization focus:
- Increase arithmetic intensity
- Maximize occupancy
- Use tensor cores (if available)
Convolution
Balanced between memory and compute.
Key metrics: Both bandwidth and GFLOPS
Optimization focus:
- Balance memory access with computation
- Use shared memory effectively
- Minimize halo region overhead
Performance
│
│ ▲ Compute Bound
│ ╱│
│ ╱ │
│ ╱ │ ← Peak Compute
│ ╱ │
│ ╱ │
│ ╱ │
│ ╱ │
│ ╱ │
│ ╱ Memory │
│ ╱ Bound │
│╱ │
────┼───────────┼────────────→ Arithmetic Intensity
0 │ (FLOPs/Byte)
Use case: Identify if your workload is memory-bound or compute-bound.
→ Detailed analysis: docs/user-guides/RESULTS_INTERPRETATION.md
Specifications:
- Compute Capability: 8.6 (Ampere)
- CUDA Cores: 2048
- Memory: 4GB GDDR6
- Memory Bandwidth: 224 GB/s
- FP32 Performance: 9.1 TFLOPS
| Benchmark | Metric | CUDA | OpenCL | DirectCompute |
|---|---|---|---|---|
| VectorAdd | GB/s | 180-200 | 150-170 | 140-160 |
| MatrixMul | GFLOPS | 800-1200 | 700-1000 | 600-900 |
| Convolution | GB/s | 250-350 | 220-300 | 200-280 |
| Reduction | GB/s | 150-180 | 130-160 | 120-150 |
VectorAdd: 180 / 224 = 80% of peak bandwidth ✅ Excellent!
MatrixMul: 1000 / 9100 = 11% of peak compute ✅ Realistic
Convolution: 300 / 224 = 134% (compute helps) ✅ Good!
Reduction: 180 / 224 = 80% of peak bandwidth ✅ Excellent!
- VectorAdd: ~850 GB/s
- MatrixMul: ~20,000 GFLOPS
- Memory: 24GB GDDR6X (936 GB/s)
- VectorAdd: ~450 GB/s (OpenCL/DirectCompute)
- MatrixMul: ~18,000 GFLOPS
- Memory: 16GB GDDR6 (512 GB/s)
- VectorAdd: ~400 GB/s (OpenCL/DirectCompute)
- MatrixMul: ~15,000 GFLOPS
- Memory: 16GB GDDR6 (560 GB/s)
- More mature drivers - NVIDIA optimizes CUDA heavily
- Better compiler - nvcc produces efficient code
- Hardware optimizations - GPU designed with CUDA in mind
- Warp-level primitives -
__shfl_down_sync(), etc.
Typical overhead:
- OpenCL: 10-20% slower than CUDA
- DirectCompute: 15-25% slower than CUDA
File: CMakeLists.txt
Key features:
- Detects CUDA, OpenCL, DirectX automatically
- Conditionally compiles backends based on availability
- Separate targets for tests
- CUDA architecture configuration
Main targets:
- GPU-Benchmark-GUI # Main GUI application
- GPU-Benchmark # CLI version
- test_cuda_backend # Unit tests
- test_opencl_backend
- test_directcompute_backend
- test_matmul
- test_convolution
- test_reductionCUDA Architecture:
set(CMAKE_CUDA_ARCHITECTURES 86) # RTX 3050Change for your GPU:
- RTX 4000: 89
- RTX 3000: 86
- RTX 2000: 75
- GTX 1000: 61
Preprocessor Definitions:
USE_CUDA # Enable CUDA backend
USE_OPENCL # Enable OpenCL backend (if found)
USE_DIRECTCOMPUTE # Enable DirectCompute (Windows only)Quick build:
scripts\build\BUILD.cmdManual build:
mkdir build
cd build
cmake -G "Visual Studio 17 2022" -A x64 ..
cmake --build . --config ReleaseClean rebuild:
scripts\build\REBUILD_FIXED.cmdbuild/
├── Release/
│ ├── GPU-Benchmark-GUI.exe ← Main executable
│ ├── GPU-Benchmark.exe ← CLI version
│ └── test_*.exe ← Unit tests
└── CMakeFiles/ ← Build artifacts
→ Complete build guide: docs/build-setup/BUILD_GUIDE.md
Challenge: CUDA, OpenCL, DirectCompute have completely different APIs.
Solution:
- Created
IComputeBackendinterface - Each backend implements same contract
- BenchmarkRunner doesn't know which backend it's using
Learning: Interface-based design enables extensibility.
Challenge: CPU timers don't work for asynchronous GPU execution.
Solution:
- CUDA:
cudaEvent_twithcudaEventElapsedTime() - OpenCL:
cl_eventwith profiling queries - DirectCompute:
ID3D11Querywith timestamps
Learning: Each API has its own timing mechanism.
Challenge: Naive memory access = 10x slower performance.
Solution:
- Stride-1 access patterns
- Adjacent threads access adjacent memory
- Align data structures properly
Learning: Memory access patterns matter as much as algorithm.
Challenge: OpenCL compiles kernels from strings at runtime.
Solution:
- Embed kernel source in C++ with R"(...)" literals
- Handle compilation errors gracefully
- Cache compiled kernels
Learning: Runtime compilation adds flexibility but complicates error handling.
Challenge: GUI rendering interferes with benchmark timing.
Solution:
- Worker thread for benchmarks
- Atomic variables for progress
- Separate GPU contexts for compute and rendering
Learning: Separate compute and graphics execution streams.
Challenge: Detect GPUs/APIs without crashing on unavailable hardware.
Solution:
- Try each API initialization, catch failures
- DXGI for vendor-neutral GPU enumeration
- Friendly error messages
Learning: Runtime detection enables hardware-agnostic deployment.
Challenge: How to verify GPU results are correct?
Solution:
- CPU reference implementation
- Compare GPU vs CPU output
- Floating-point epsilon tolerance
Learning: Correctness verification is essential.
Challenge: Same algorithm, three implementations, must match.
Solution:
- Identical algorithm logic
- Same problem sizes
- Careful verification
Learning: Fair comparison requires mathematical equivalence.
→ Detailed technical challenges: docs/INTERNAL_WORKINGS.md
- FFT (Fast Fourier Transform)
- Sorting (Radix sort, Bitonic sort)
- Sparse Matrix operations
- Histogram computation
- Scan/Prefix sum
- Multi-GPU support
- FP16/FP64 precision testing
- Tensor Core utilization (NVIDIA)
- Power consumption measurement
- Temperature monitoring
- 3D performance graphs
- Real-time GPU utilization display
- Kernel execution timeline
- Comparative analysis charts
- Export to PDF reports
- Linux support (Vulkan Compute instead of DirectCompute)
- macOS support (Metal Performance Shaders)
- Android support (OpenCL ES)
- Neural network layer benchmarks
- Convolution variants (depthwise, separable)
- Batch normalization
- Attention mechanisms
- Transformer benchmarks
Want a feature? Open an issue on GitHub!
Requested features:
- Command-line interface with arguments
- Automated report generation
- Benchmark database (compare with other users)
- Overclocking impact analysis
- Driver version comparison
This project has 10,000+ lines of documentation across multiple files.
| Document | Purpose | Lines |
|---|---|---|
| README.md | This file - main documentation | 2000+ |
| WHY_THIS_PROJECT.md | Philosophy and motivation | 600+ |
| ARCHITECTURE.md | System design deep-dive | 750+ |
| INTERNAL_WORKINGS.md | Implementation details | 980+ |
| GETTING_STARTED.md | Complete setup guide | 700+ |
| API_REFERENCES.md | Learning resources | 500+ |
- BUILD_GUIDE.md - Detailed build instructions
- FRESH_START_WITH_VS2022.md - VS2022 setup
- SETUP_IMGUI_MANUAL.md - ImGui integration
- START_HERE.md - First-time user guide
- HOW_TO_USE_GUI.md - GUI walkthrough
- RESULTS_INTERPRETATION.md - Understanding output
Every source file has:
- File header explaining purpose
- Function-level documentation
- Algorithm explanations
- Performance notes
- Interview talking points
Example from vector_add.cu:
/**
* Vector Addition Kernel - Simplest GPU Operation
*
* Purpose: Add two vectors element-wise (C = A + B)
*
* Performance Characteristics:
* - Memory-bound (limited by DRAM bandwidth, not compute)
* - Coalescing critical (adjacent threads access adjacent memory)
* - Expected: 70-85% of theoretical peak bandwidth
*
* Interview talking points:
* - This demonstrates memory coalescing
* - Shows basic CUDA thread indexing
* - Illustrates memory-bound vs compute-bound workloads
*/
__global__ void vectorAddKernel(...) { ... }CUDA:
OpenCL:
DirectCompute:
- "Programming Massively Parallel Processors" - Kirk & Hwu
- "CUDA by Example" - Sanders & Kandrot
- "Professional CUDA C Programming" - Cheng et al.
- "Heterogeneous Computing with OpenCL 2.0" - Kaeli et al.
→ Complete resource list: docs/API_REFERENCES.md
Contributions are welcome! This project is designed to be:
- Educational - Learn from working code
- Extensible - Easy to add new features
- Professional - High code quality standards
- Fork the repository
- Create a feature branch
git checkout -b feature/amazing-feature
- Make your changes
- Test thoroughly
- Commit with clear messages
git commit -m "Add amazing feature: description" - Push to your fork
git push origin feature/amazing-feature
- Open a Pull Request
Easy (Good first issues):
- Add more CSV export options
- Improve error messages
- Add tooltips to GUI
- Update documentation
- Fix typos
Medium:
- Add new benchmark (FFT, sorting)
- Improve visualization
- Add CLI arguments
- Performance optimizations
Hard:
- Multi-GPU support
- Vulkan Compute backend
- Metal backend (macOS)
- Profiling integration
- C++17 standard
- Clean code principles
- Comprehensive comments
- Design patterns where appropriate
- RAII for resource management
Before submitting:
- Code compiles without warnings
- All unit tests pass
- Benchmarks run successfully
- Documentation updated
- No memory leaks (checked with tools)
MIT License
Copyright (c) 2026 Soham Dave
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
What this means:
- ✅ Free to use for any purpose
- ✅ Free to modify and distribute
- ✅ Can use in commercial projects
- ✅ No warranty provided
Soham Dave
- GitHub: @davesohamm
- Project: GPU Benchmark Suite v1.0
- Date: January 2026
- System: Windows 11 | AMD Ryzen 7 4800H | NVIDIA RTX 3050 | 16GB RAM
- Development Time: 3+ months
- Code Lines: ~22,000 lines (source code)
- Documentation Lines: ~20,000 lines
- Total Lines: ~42,000 lines
- Files: 150+ files
- Commits: 5+ commits
- Languages: C++, CUDA, OpenCL, HLSL, CMake, Batch
- APIs: CUDA, OpenCL, DirectCompute, DirectX, Windows API, ImGui
"I wanted to deeply understand GPU programming, compare different APIs objectively, and create a portfolio piece that showcases professional software engineering skills. This project represents hundreds of hours of learning, coding, debugging, optimizing, and documenting."
Skills Demonstrated:
- GPU Programming (CUDA, OpenCL, DirectCompute)
- Systems Programming (Windows API, drivers, hardware)
- Performance Engineering (profiling, optimization, analysis)
- Software Architecture (design patterns, clean code)
- Professional Documentation (comprehensive guides)
- Build Systems (CMake, Visual Studio)
- GUI Development (ImGui, DirectX)
- NVIDIA - CUDA Toolkit and excellent documentation
- Khronos Group - OpenCL specification and standards
- Microsoft - DirectX SDK and Visual Studio
- ImGui - Omar Cornut for the amazing GUI framework
- NVIDIA Developer Blog - GPU programming best practices
- Mark Harris - Parallel reduction optimization paper
- David Kirk & Wen-mei Hwu - "Programming Massively Parallel Processors" book
- Stack Overflow Community - Countless helpful answers
- GPU computing revolution in AI/ML
- Need for objective multi-API comparison
- Desire to create comprehensive learning resource
Documentation:
- Read this README thoroughly
- Check docs/ folder for detailed guides
- See Troubleshooting section
Issues:
- GitHub Issues: Report bugs or request features
Questions:
- GitHub Discussions: Ask questions
- Repository: https://github.com/davesohamm/GPU-Benchmark
- Documentation: docs/
- Releases: https://github.com/davesohamm/GPU-Benchmark/releases
- Issues: https://github.com/davesohamm/GPU-Benchmark/issues
If you found this project helpful:
- ⭐ Star this repository on GitHub
- 🍴 Fork it and add your own features
- 📢 Share it with others learning GPU programming
- 💬 Open issues with feedback or questions
- 🤝 Contribute improvements and fixes
┌──────────────────────────────────────────────────────────┐
│ GPU Benchmark Suite v1.0 │
│ Production Ready Status │
├──────────────────────────────────────────────────────────┤
│ │
│ Code Quality │
│ ├─ Source Lines: ~42,000 lines │
│ ├─ Documentation Lines: ~20,000 lines │
│ ├─ Documentation Ratio: 45% (industry avg: 20-30%) │
│ ├─ Files: 150+ files │
│ └─ Comments: Extensive │
│ │
│ Features │
│ ├─ GPU APIs: 3 (CUDA, OpenCL, DirectCompute) │
│ ├─ Benchmarks: 4 (VectorAdd, MatMul, Conv, Red) │
│ ├─ Unit Tests: 9 test executables │
│ └─ GUI Application: ✅ Complete │
│ │
│ Architecture │
│ ├─ Design Patterns: 6 (Strategy, Factory, etc.) │
│ ├─ Abstraction Layers: 4 layers │
│ ├─ Threading: Main + Worker threads │
│ └─ Memory Management: RAII pattern │
│ │
│ Performance │
│ ├─ Bandwidth Achieved: 180 GB/s (80% efficiency) │
│ ├─ Compute Achieved: 1000 GFLOPS (MatMul) │
│ ├─ Timing Accuracy: GPU-side (microsecond precision) │
│ └─ Verification: 100% results verified │
│ │
│ Documentation │
│ ├─ README Files: 10+ comprehensive guides │
│ ├─ Code Comments: Every function documented │
│ ├─ Build Guides: Step-by-step instructions │
│ └─ Learning Resources: Books, papers, tutorials │
│ │
│ Production Readiness │
│ ├─ Error Handling: ✅ Robust │
│ ├─ Icon Integration: ✅ Complete │
│ ├─ Professional UI: ✅ Polished │
│ ├─ Version Info: ✅ v1.0.0 │
│ └─ Distribution Ready: ✅ Yes │
│ │
└──────────────────────────────────────────────────────────┘
Built with ❤️ by Soham Dave
Benchmark your GPU. Compare APIs. Learn GPU programming. Share your results.
Version: 1.0.0 | Released: January 2026 | Last Updated: January 9, 2026
© 2026 Soham Dave. All Rights Reserved.