minigrad

A scalar-valued automatic differentiation engine built from scratch in C++17. Implements reverse-mode autodiff over a dynamically constructed computation graph, with a neural network library on top. Header-only.

Inspired by Karpathy's micrograd, reimplemented independently in C++ over ~2 days.

What is this?

Modern deep learning frameworks like PyTorch and TensorFlow automatically compute gradients for you. This project implements that mechanism from scratch at the scalar level - every number is a node in a computation graph, and gradients flow backward through the graph via the chain rule.

This is not a production library. It's a learning project that demonstrates a complete, working autodiff in C++.

How it works

The Value class

Every scalar is wrapped in a Value object. Operations on Value objects don't just compute results - they record the computation graph.

Value a(2.0);
Value b(3.0);
Value c = a * b;  // c.node->data = 6.0
                  // c knows its inputs are a and b
                  // c knows how to backprop through multiplication
c.backward();     // dc/da = 3.0, dc/db = 2.0

Memory model

The core design problem in an autograd engine is keeping graph nodes alive for the backward pass. Expressions like:

Value loss = (pred - target) ^ 2;

create intermediate nodes that need to outlive the expression. Solved using shared_ptr<Node> - each Value is a lightweight handle to a heap-allocated Node. Copying a Value is cheap and shares the same underlying node, keeping the graph alive as long as any handle exists.

This also makes parameter updates intuitive:

// This works correctly even on copies:
for (auto& p : net.params()) {
    p -= eta * p.get_grad(); // mutates the underlying node
}

Backpropagation

backward() builds a topological ordering of the graph and traverses it in reverse, calling each node's stored _backward lambda to accumulate gradients.

void backward() {
    // build topological order
    // set output gradient to 1.0
    // traverse in reverse, call _backward() on each node
}

Features

Header-only - just include and compile
Operators - +, -, *, /, ^ (scalar and Value overloads)
Activations - tanh, sigmoid, relu
Neural network library - Neuron, Layer, NeuralNet
Loss functions - MSE (single sample and batched)
Computation graph visualization - via Graphviz
Batched training - feedforward and loss over multiple samples

Demo - XOR

XOR is the classic test for nonlinear learning. It cannot be solved by a linear model, requiring hidden layer representations.

NeuralNet net({2, 4, 1}); // 2 inputs, 4 hidden, 1 output

vector<vector<Value>> inp = {{1,1},{1,0},{0,1},{0,0}};
vector<vector<Value>> ans = {{0},{1},{1},{0}};

for (uint i = 0; i < 5000; i++) {
    auto out = net.feedforward_batch(inp);
    Value loss = net.mse_loss_batch(out, ans);
    loss.backward();
    for (auto& p : net.params())
        p -= 0.1 * p.get_grad();
    net.zero_grad();
}
// outputs: ~0.01, ~0.99, ~0.99, ~0.002
// loss : ~ 0.0003

Computation Graph

Graphviz visualization of the computation graph for a simple expression:

f = (a*b + c)^2

Generate your own:

build_dot(loss.node, "graph.dot");
system("dot -Tpng graph.dot -o graph.png");

Build

Header-only. No dependencies beyond the standard library and Graphviz for visualization.

g++ -std=c++17 main.cpp -o main && ./main

For graph visualization, install Graphviz:

# Ubuntu
sudo apt install graphviz

# macOS
brew install graphviz

What I learned

Autograd design

The interesting engineering problem is graph lifetime management. In Python, garbage collection handles this transparently. In C++, you have to be explicit. The shared_ptr design means Value copies are O(1) and the graph stays alive for backprop without manual memory management.

Activation functions and initialization are coupled

Building this, I ran into vanishing gradients, dead neurons, and initialization sensitivity firsthand - the same problems that stalled deep learning research for decades.

Sigmoid - gradients shrink by 75% per layer minimum. Needed lr=10 just to compensate. Impractical for deep nets.
ReLU - neurons initialized in the negative region output zero, receive zero gradient, never recover. With uniform [-1,1] init, half the network was dead from step one.
Tanh - derivative peaks at 1.0 near zero, and uniform [-1,1] init keeps neurons in that region at startup. Gradients flow cleanly.

This is what Glorot & Bengio formalized in their 2010 paper. I rediscovered it by accident.

Gradient flow through the graph

Writing each backward pass by hand makes the chain rule concrete. The multiplication backward:

a_node->grad += b_node->data * out_node->grad;
b_node->grad += a_node->data * out_node->grad;

is just d(a*b)/da = b written in code. Doing this for every operation makes backprop feel less like magic.

Project Structure

minigrad/
│
├── src/
│   ├── value.h      # Value class, operators, activations
│   ├── engine.h     # Neuron, Layer, NeuralNet
│   └── graph.h      # Graphviz dot file generation
│
├── tests/
│   ├── xor         
│   │   ├── xor.cpp             # XOR demonstration with neural net 
│   │   └── xor-graph.png       # Computational graph of XOR 
│   └── simple-graphs     
│       ├── simple-graph.cpp    # Simple expression computation 
│       └── simple-graph.png    # Associated computation graph 
│
├── main.cpp                    # empty example file 
└── README.md

References

Karpathy's micrograd
Glorot & Bengio, 2010 - Understanding the difficulty of training deep networks
The spelled-out intro to neural networks and backpropagation

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
resources		resources
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
main.cpp		main.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

minigrad

What is this?

How it works

The Value class

Memory model

Backpropagation

Features

Demo - XOR

Computation Graph

Build

What I learned

Autograd design

Activation functions and initialization are coupled

Gradient flow through the graph

Project Structure

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

minigrad

What is this?

How it works

The Value class

Memory model

Backpropagation

Features

Demo - XOR

Computation Graph

Build

What I learned

Autograd design

Activation functions and initialization are coupled

Gradient flow through the graph

Project Structure

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages