Data movement matters more than FLOPs. Recently accessed bytes can be cached, penalize non-local reads using the following cost model:
where
from bytedmd import bytedmd
def dot(a, b):
return sum(i1*i2 for (i1,i2) in zip(a,b))
a = [0, 1]
b = [2, 3]
# dot product
assert dot(a,b) == 3
# ByteDMD cost of dot product
assert bytedmd(dot, (a, b)) == 14Modern architectures spend more energy moving data than doing arithmetic, making FLOP counts an outdated cost metric. Bill Dally (ACM Opinion) proposed penalizing data movement based on 2D spatial distance to the processor. To avoid manual spatial mapping, Ding and Smith (Beyond Time Complexity, 2022) automated this via Data Movement Distance (DMD): a rule treating memory as an LRU stack where reading a byte at depth
To avoid floating point issues, we round up to the nearest integer.
This rounding corresponds to routing wire length on a 2D grid with LRU stack arranged in the following order.
The original DMD treats values abstractly. ByteDMD counts accesses at byte level. This rewards algorithms that use smaller data types.
An idealized processor operates directly on an element-level LRU stack. Computations and writes are free; only memory reads incur a cost.
- Stack State: Ordered from least recently used (bottom) to most recently used (top). Depth is measured in bytes from the top (topmost byte = depth 1). Multi-byte scalars are treated as a contiguous blocks of bytes.
- Initialization: On function entry, arguments are pushed to the top in call order.
-
Read Cost: Reading a byte at depth
$d$ costs$\lceil\sqrt{d}\rceil$ .
See Instruction Set for the complete list of supported instructions.
For an instruction with inputs
-
Price reads: Evaluate
$\sum C(x_j)$ against the stack state before the instruction begins. Repeated inputs are charged per occurrence (e.g.,a + acharges for readingatwice). -
Update LRU: Move inputs to the top of the stack sequentially in read order. (Note: Because of this sequential update,
b + cvs.c + byields the same cost but different final stack states). - Push outputs: Allocate new output blocks and push them to the top at zero cost.
Consider the following function with four scalar arguments:
def my_add(a, b, c, d):
return b + c1. Initial Stack
Arguments are pushed in call order [a, b, c, d], yielding element depths from the top:
d: depth 1c: depth 2b: depth 3a: depth 4
2. Read Cost
Inputs are priced simultaneously against the initial stack state:
3. Update Stack
Inputs move to the top sequentially in read order (b, then c), followed by the new result being pushed:
[a, d, b, c, result]
See "benchmarks/" folder
| Algorithm | Operation | ByteDMD Cost |
|---|---|---|
| matvec (i-j) | y = A @ x | 194 |
| vecmat (j-i) | y = x^T @ A | 191 |
| Algorithm | Operation | ByteDMD Cost |
|---|---|---|
| matmul (i-j-k) | C = A @ B | 948 |
| matmul (i-k-j) | C = A @ B | 1016 |
| matmul (snake-j) | C = A @ B | 906 |
| matmul (2x2 tiled) | C = A @ B | 947 |
| Strassen (leaf=1) | C = A @ B | 2435 |
| Winograd | C = A @ B | 2178 |
Architecture: vocab=4, embd=4, heads=2, head_dim=2, 1 layer, block_size=4.
Based on Karpathy's microGPT.
| Algorithm | Operation | ByteDMD Cost |
|---|---|---|
| microGPT (1 layer, embd=4) | single token forward | 7047 |
This version implements ByteDMD by wrapping Python objects. This means that "Instruction Set" of this metric corresponds to Python built-ins, documented under docs/instruction_set.md.
Python behavior means this implementation occasionally doesn't match README semantics and it's possible to escaping the wrapping mechanism occasionally, known failures cases are documented as test_gotchas.py .