A scalar-valued automatic differentiation engine built from scratch in C++17. Implements reverse-mode autodiff over a dynamically constructed computation graph, with a neural network library on top. Header-only.
Inspired by Karpathy's micrograd, reimplemented independently in C++ over ~2 days.
Modern deep learning frameworks like PyTorch and TensorFlow automatically compute gradients for you. This project implements that mechanism from scratch at the scalar level - every number is a node in a computation graph, and gradients flow backward through the graph via the chain rule.
This is not a production library. It's a learning project that demonstrates a complete, working autodiff in C++.
Every scalar is wrapped in a Value object. Operations on Value objects
don't just compute results - they record the computation graph.
Value a(2.0);
Value b(3.0);
Value c = a * b; // c.node->data = 6.0
// c knows its inputs are a and b
// c knows how to backprop through multiplication
c.backward(); // dc/da = 3.0, dc/db = 2.0The core design problem in an autograd engine is keeping graph nodes alive for the backward pass. Expressions like:
Value loss = (pred - target) ^ 2;create intermediate nodes that need to outlive the expression. Solved using
shared_ptr<Node> - each Value is a lightweight handle to a heap-allocated
Node. Copying a Value is cheap and shares the same underlying node,
keeping the graph alive as long as any handle exists.
This also makes parameter updates intuitive:
// This works correctly even on copies:
for (auto& p : net.params()) {
p -= eta * p.get_grad(); // mutates the underlying node
}backward() builds a topological ordering of the graph and traverses it in
reverse, calling each node's stored _backward lambda to accumulate gradients.
void backward() {
// build topological order
// set output gradient to 1.0
// traverse in reverse, call _backward() on each node
}- Header-only - just include and compile
- Operators -
+,-,*,/,^(scalar and Value overloads) - Activations - tanh, sigmoid, relu
- Neural network library - Neuron, Layer, NeuralNet
- Loss functions - MSE (single sample and batched)
- Computation graph visualization - via Graphviz
- Batched training - feedforward and loss over multiple samples
XOR is the classic test for nonlinear learning. It cannot be solved by a linear model, requiring hidden layer representations.
NeuralNet net({2, 4, 1}); // 2 inputs, 4 hidden, 1 output
vector<vector<Value>> inp = {{1,1},{1,0},{0,1},{0,0}};
vector<vector<Value>> ans = {{0},{1},{1},{0}};
for (uint i = 0; i < 5000; i++) {
auto out = net.feedforward_batch(inp);
Value loss = net.mse_loss_batch(out, ans);
loss.backward();
for (auto& p : net.params())
p -= 0.1 * p.get_grad();
net.zero_grad();
}
// outputs: ~0.01, ~0.99, ~0.99, ~0.002
// loss : ~ 0.0003Graphviz visualization of the computation graph for a simple expression:
f = (a*b + c)^2
Generate your own:
build_dot(loss.node, "graph.dot");
system("dot -Tpng graph.dot -o graph.png");Header-only. No dependencies beyond the standard library and Graphviz for visualization.
g++ -std=c++17 main.cpp -o main && ./mainFor graph visualization, install Graphviz:
# Ubuntu
sudo apt install graphviz
# macOS
brew install graphvizThe interesting engineering problem is graph lifetime management. In Python, garbage collection handles this transparently. In C++, you have to be explicit. The shared_ptr design means Value copies are O(1) and the graph stays alive for backprop without manual memory management.
Building this, I ran into vanishing gradients, dead neurons, and initialization sensitivity firsthand - the same problems that stalled deep learning research for decades.
- Sigmoid - gradients shrink by 75% per layer minimum. Needed lr=10 just to compensate. Impractical for deep nets.
- ReLU - neurons initialized in the negative region output zero, receive zero gradient, never recover. With uniform [-1,1] init, half the network was dead from step one.
- Tanh - derivative peaks at 1.0 near zero, and uniform [-1,1] init keeps neurons in that region at startup. Gradients flow cleanly.
This is what Glorot & Bengio formalized in their 2010 paper. I rediscovered it by accident.
Writing each backward pass by hand makes the chain rule concrete. The multiplication backward:
a_node->grad += b_node->data * out_node->grad;
b_node->grad += a_node->data * out_node->grad;is just d(a*b)/da = b written in code. Doing this for every operation
makes backprop feel less like magic.
minigrad/
│
├── src/
│ ├── value.h # Value class, operators, activations
│ ├── engine.h # Neuron, Layer, NeuralNet
│ └── graph.h # Graphviz dot file generation
│
├── tests/
│ ├── xor
│ │ ├── xor.cpp # XOR demonstration with neural net
│ │ └── xor-graph.png # Computational graph of XOR
│ └── simple-graphs
│ ├── simple-graph.cpp # Simple expression computation
│ └── simple-graph.png # Associated computation graph
│
├── main.cpp # empty example file
└── README.md
- Karpathy's micrograd
- Glorot & Bengio, 2010 - Understanding the difficulty of training deep networks
- The spelled-out intro to neural networks and backpropagation

