Deep Learning on Raspberry Pi and Low-End Devices Made Easy
"Democratizing Deep Learning for Resource-Constrained Environments"
LowMind is a pure-NumPy deep learning framework built from scratch for Raspberry Pi, embedded systems, and any resource-constrained environment. It gives you a PyTorch-like API without the multi-GB installation — just NumPy and psutil.
pip install lowmind- Features
- Installation
- Quick Start
- Full API Reference
- Examples
- Project Structure
- Raspberry Pi Tips
- Contributing
- License
| Category | What's included |
|---|---|
| Autograd | Reverse-mode automatic differentiation, full broadcasting, tuple-axis support |
| Layers | Linear, Conv2d, BatchNorm1d/2d, MaxPool2d, AvgPool2d, Flatten, Dropout, Embedding |
| Activations | ReLU, LeakyReLU, ELU, GELU, Sigmoid, Tanh, Softmax, LogSoftmax |
| Loss Functions | CrossEntropy, BCE, MSE, MAE, Huber, NLL |
| Optimizers | SGD (+ Nesterov), Adam, AdamW, RMSprop, AdaGrad |
| LR Schedulers | StepLR, MultiStepLR, ExponentialLR, CosineAnnealingLR, ReduceLROnPlateau, CyclicLR, LinearWarmup |
| Data | Dataset, TensorDataset, DataLoader, train_test_split |
| Metrics | accuracy, top-k accuracy, precision, recall, F1, confusion matrix, R², MSE, MAE |
| Trainer | High-level training loop with callbacks, gradient clipping, validation |
| Callbacks | EarlyStopping, ModelCheckpoint, LRSchedulerCallback, History |
| Models | MicroMLP, MicroCNN, TinyResNet |
| Monitoring | SystemMonitor, memory_trace, health_score |
| Model I/O | save/load (compressed gzip or plain pickle), state_dict, load_state_dict |
pip install lowmindgit clone https://github.com/dhaval-vedra/lowmind.git
cd lowmind
pip install -e .sudo apt update
sudo apt install python3-pip python3-numpy python3-psutil
pip3 install lowmindnumpy>=1.19.0
psutil>=5.8.0
import lowmind as lm
import numpy as np
# Build a model
model = lm.Sequential(
lm.Linear(784, 128),
lm.ReLU(),
lm.Dropout(0.3),
lm.Linear(128, 10),
)
# Create optimizer
optimizer = lm.Adam(model.parameters(), lr=1e-3)
# Prepare data
X = np.random.randn(1000, 784).astype(np.float32)
y = np.random.randint(0, 10, 1000)
loader = lm.DataLoader(lm.TensorDataset(X, y), batch_size=64, shuffle=True)
# Training loop
for epoch in range(20):
model.train()
for X_batch, y_batch in loader:
optimizer.zero_grad()
output = model(X_batch)
loss = lm.cross_entropy_loss(output, y_batch)
loss.backward()
optimizer.step()
print(f"Epoch {epoch+1} done")lm.Tensor is the core data structure — an N-dimensional array with automatic gradient tracking.
import lowmind as lm
import numpy as np
# From data
t = lm.Tensor([1.0, 2.0, 3.0]) # from list
t = lm.Tensor(np.array([[1, 2], [3, 4]])) # from numpy
t = lm.Tensor(5.0, requires_grad=True) # scalar with grad
# Factory functions
lm.zeros(3, 4) # shape (3,4) filled with 0
lm.ones(2, 2) # shape (2,2) filled with 1
lm.randn(10, 10) # shape (10,10) random normal
lm.rand(5, 5) # shape (5,5) random uniform [0,1]
lm.arange(0, 10, 2) # [0, 2, 4, 6, 8]
lm.from_numpy(arr) # wrap a numpy arraya = lm.Tensor([1., 2., 3.], requires_grad=True)
b = lm.Tensor([4., 5., 6.], requires_grad=True)
c = a + b # addition
c = a - b # subtraction
c = a * b # element-wise multiply
c = a / b # element-wise divide
c = a ** 2 # power
c = a @ b # matrix multiply (for 2-D)
c = -a # negationx = lm.Tensor([[1., 2.], [3., 4.]])
x.sum() # scalar: 10.0
x.sum(axis=0) # [4., 6.]
x.sum(axis=1, keepdims=True) # [[3.], [7.]]
x.mean() # 2.5
x.mean(axis=(2, 3)) # works with tuple axis (CNN global pooling)
x.max(axis=1) # row-wise max
x.min() # global minx = lm.Tensor([-2., -1., 0., 1., 2.])
x.relu() # [0, 0, 0, 1, 2]
x.sigmoid() # [0.12, 0.27, 0.5, 0.73, 0.88]
x.tanh() # [-0.96, -0.76, 0, 0.76, 0.96]
x.leaky_relu(0.01) # [-0.02, -0.01, 0, 1, 2]
x.elu(1.0) # smooth version of relu
x.gelu() # gaussian error linear
x.softmax(axis=-1) # probability distribution
x.exp() # element-wise e^x
x.log() # element-wise ln(x)
x.abs() # absolute value
x.clip(-1, 1) # clamp valuesx = lm.Tensor(np.arange(24).reshape(2, 3, 4))
x.reshape(6, 4) # (6, 4)
x.flatten(start_dim=1) # (2, 12)
x.transpose((0, 2, 1)) # (2, 4, 3)
x.T # transpose (last two dims)
x.squeeze(axis=1) # remove size-1 dims
x.unsqueeze(axis=0) # add dim
x[0] # index — gradient flows through# Compute gradient of y = x^2 + 2x + 1 at x=3
x = lm.Tensor(3.0, requires_grad=True)
y = x**2 + 2*x + 1
y.backward()
print(x.grad) # 8.0 (dy/dx = 2x + 2 = 8)
# Multi-variable
a = lm.Tensor([1., 2.], requires_grad=True)
b = lm.Tensor([3., 4.], requires_grad=True)
loss = (a * b).sum()
loss.backward()
print(a.grad) # [3., 4.]
print(b.grad) # [1., 2.]
# Gradient clipping
lm.clip_grad_norm(model.parameters(), max_norm=1.0)t.item() # extract Python float (for 0-d or 1-element tensors)
t.numpy() # get the underlying numpy array
t.detach() # new tensor without grad tracking
t.copy() # full copy including grad
t.shape # shape tuple
t.ndim # number of dimensions
t.size # total number of elements
t.dtype # numpy dtype (always float32)
t.zero_grad() # fill grad with zeros
repr(t) # Tensor(shape=(3,), dtype=float32, requires_grad=True)All layers are subclasses of lm.Module. They can be used standalone or combined in lm.Sequential.
layer = lm.Linear(in_features=784, out_features=256, bias=True)
# Input: (N, 784)
# Output: (N, 256)layer = lm.Conv2d(
in_channels=3,
out_channels=32,
kernel_size=3, # or (3, 3)
stride=1, # or (1, 1)
padding=1, # or (1, 1)
bias=True,
)
# Input: (N, 3, H, W)
# Output: (N, 32, H, W) when padding=1, stride=1bn1 = lm.BatchNorm1d(256) # for (N, features) inputs
bn2 = lm.BatchNorm2d(32) # for (N, C, H, W) inputs
# Normalizes to mean=0, std=1 per batch
# Has learnable gamma (scale) and beta (shift)pool = lm.MaxPool2d(kernel_size=2, stride=2) # halves spatial dims
pool = lm.AvgPool2d(kernel_size=2)
# Input: (N, C, H, W)
# Output: (N, C, H//2, W//2)flatten = lm.Flatten(start_dim=1)
# (N, C, H, W) → (N, C*H*W)drop = lm.Dropout(p=0.5) # 50% dropout during training
# Automatically disabled during model.eval()embed = lm.Embedding(num_embeddings=10000, embedding_dim=128)
indices = lm.Tensor([0, 3, 7])
out = embed(indices) # (3, 128)class MyBlock(lm.Module):
def __init__(self, in_features, out_features):
super().__init__()
self.fc = lm.Linear(in_features, out_features)
self.bn = lm.BatchNorm1d(out_features)
def forward(self, x: lm.Tensor) -> lm.Tensor:
return self.bn(self.fc(x)).relu()
block = MyBlock(64, 32)
out = block(lm.Tensor(np.random.randn(8, 64).astype(np.float32)))
# out.shape → (8, 32)Stack layers in order:
from collections import OrderedDict
# Positional
model = lm.Sequential(
lm.Linear(784, 256),
lm.ReLU(),
lm.BatchNorm1d(256),
lm.Dropout(0.3),
lm.Linear(256, 10),
)
# Named (OrderedDict)
model = lm.Sequential(OrderedDict([
('fc1', lm.Linear(784, 256)),
('relu', lm.ReLU()),
('fc2', lm.Linear(256, 10)),
]))
print(model) # shows architecture
model.num_parameters() # total trainable parameter countAll loss functions return a scalar Tensor with requires_grad=True.
# logits: (N, C) targets: (N,) integer class indices
loss = lm.cross_entropy_loss(logits, targets)
loss = lm.cross_entropy_loss(logits, targets, reduction='sum')# output: probabilities [0,1] or raw logits
loss = lm.binary_cross_entropy_loss(output, targets)
loss = lm.binary_cross_entropy_loss(logits, targets, from_logits=True)loss = lm.mse_loss(predictions, targets)
loss = lm.mse_loss(predictions, targets, reduction='sum')loss = lm.mae_loss(predictions, targets)# Quadratic for |error| < delta, linear otherwise
loss = lm.huber_loss(predictions, targets, delta=1.0)log_probs = lm.LogSoftmax()(logits) # (N, C)
loss = lm.nll_loss(log_probs, targets) # (N,)All optimizers share the same interface:
optimizer = lm.Adam(model.parameters(), lr=1e-3)
# Each training step:
optimizer.zero_grad() # reset gradients
loss.backward() # compute gradients
optimizer.step() # update weightsoptimizer = lm.SGD(
model.parameters(),
lr=0.01,
momentum=0.9, # Nesterov-style momentum
weight_decay=1e-4, # L2 regularization
nesterov=True, # Nesterov momentum
)optimizer = lm.Adam(
model.parameters(),
lr=1e-3,
betas=(0.9, 0.999), # (beta1, beta2)
eps=1e-8,
weight_decay=0.0,
amsgrad=False, # AMSGrad variant
)# Adam with decoupled weight decay (preferred for regularization)
optimizer = lm.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)optimizer = lm.RMSprop(
model.parameters(),
lr=1e-3,
alpha=0.99, # smoothing factor
momentum=0.0,
weight_decay=0.0,
)optimizer = lm.AdaGrad(model.parameters(), lr=0.01)scheduler = lm.StepLR(optimizer, step_size=10, gamma=0.5)
for epoch in range(epochs):
train(...)
scheduler.step()scheduler = lm.CosineAnnealingLR(optimizer, T_max=50, eta_min=1e-6)scheduler = lm.ReduceLROnPlateau(
optimizer, mode='min', patience=5, factor=0.5, verbose=True)
for epoch in range(epochs):
val_loss = validate(...)
scheduler.step(val_loss) # pass the metricscheduler = lm.MultiStepLR(optimizer, milestones=[30, 60, 90], gamma=0.1)scheduler = lm.ExponentialLR(optimizer, gamma=0.95)scheduler = lm.LinearWarmupLR(optimizer, warmup_steps=1000, target_lr=1e-3)scheduler = lm.CyclicLR(
optimizer, base_lr=1e-4, max_lr=1e-1,
step_size=2000, mode='triangular')
for batch in loader:
train(...)
scheduler.step() # step per batch, not per epoch# Wrap numpy arrays or Tensors
ds = lm.TensorDataset(X_train, y_train)
print(len(ds)) # number of samples
X, y = ds[0] # get first sample
# Custom Dataset
class MyDataset(lm.Dataset):
def __init__(self, X, y):
self.X, self.y = X, y
def __len__(self):
return len(self.X)
def __getitem__(self, idx):
return self.X[idx], self.y[idx]loader = lm.DataLoader(
dataset=ds,
batch_size=64,
shuffle=True, # shuffle before each epoch
drop_last=False, # drop incomplete last batch
)
for X_batch, y_batch in loader:
# X_batch and y_batch are Tensors
pass
print(len(loader)) # number of batchesX_train, X_val, y_train, y_val = lm.train_test_split(
X, y,
test_size=0.2, # 20% validation
shuffle=True,
seed=42,
)All metrics accept Tensors or numpy arrays.
# Classification
lm.accuracy(predictions, targets) # 0-1 float
lm.top_k_accuracy(logits, targets, k=5) # 0-1 float
lm.precision(logits, targets, num_classes=10) # macro by default
lm.recall(logits, targets, num_classes=10)
lm.f1_score(logits, targets, num_classes=10)
lm.confusion_matrix(logits, targets) # (C, C) numpy array
# Regression
lm.r2_score(predictions, targets) # R² coefficient
lm.mean_squared_error(predictions, targets) # MSE
lm.mean_absolute_error(predictions, targets) # MAE
# All precision/recall/f1 support average='macro', 'micro', or 'none'
per_class_f1 = lm.f1_score(logits, targets, num_classes=10, average='none')High-level training loop — handles training, validation, logging, and callbacks automatically.
trainer = lm.Trainer(
model=model,
optimizer=lm.Adam(model.parameters(), lr=1e-3),
loss_fn=lm.cross_entropy_loss,
callbacks=[
lm.EarlyStopping(patience=10),
lm.ModelCheckpoint('/tmp/best.lmz'),
],
clip_grad=1.0, # gradient norm clipping (0 = off)
verbose=1, # print every N epochs
)
history = trainer.fit(train_loader, val_loader, epochs=100)
# history = {'train_loss': [...], 'val_loss': [...], 'val_acc': [...]}
# Evaluate
val_loss, val_acc = trainer.evaluate(val_loader)
# Inference
predictions = trainer.predict(X_test) # numpy array of class indicescb = lm.EarlyStopping(
patience=10, # epochs to wait
min_delta=1e-4, # minimum improvement
mode='min', # 'min' for loss, 'max' for accuracy
verbose=True,
)cb = lm.ModelCheckpoint(
filepath='/tmp/best_model.lmz',
monitor='val_loss',
mode='min',
verbose=True,
save_best_only=True,
)scheduler = lm.ReduceLROnPlateau(optimizer, patience=5)
cb = lm.LRSchedulerCallback(scheduler, monitor='val_loss')history_cb = lm.History()
trainer.fit(train_loader, val_loader, epochs=50)
print(history_cb.history['train_loss'])model = lm.MicroMLP(
input_size=784,
hidden_sizes=[256, 128], # list of hidden layer sizes
output_size=10,
dropout=0.3,
)model = lm.MicroCNN(
in_channels=3, # 3 = RGB, 1 = grayscale
num_classes=10,
input_size=32, # spatial size (HxW must be square)
dropout=0.2,
)
# Input: (N, 3, 32, 32)
# Output: (N, 10)model = lm.TinyResNet(
in_channels=3,
num_classes=10,
input_size=32,
base_filters=16, # reduce to 8 for very constrained devices
)# Save weights (compressed gzip — recommended)
model.save('/path/to/model.lmz')
# Save uncompressed
model.save('/path/to/model.lm', compress=False)
# Load into a same-architecture model
model.load('/path/to/model.lmz')
# Access raw state dict
sd = model.state_dict() # {'0.weight': ndarray, '0.bias': ndarray, ...}
model.load_state_dict(sd) # restore from dict
model.load_state_dict(sd, strict=False) # ignore missing keys
# Count parameters
model.num_parameters() # total trainable params
model.summary() # print architecture table# Configure memory limit (especially important on Raspberry Pi)
lm.configure_memory(max_mb=128) # default 256MB
# Monitor system health
monitor = lm.SystemMonitor()
monitor.print_status() # print CPU, RAM, temp stats
stats = monitor.get_stats() # dict of all stats
score = monitor.health_score() # 0-100 score
# Trace memory usage of a block
with lm.memory_trace("Forward Pass"):
out = model(X)
# Optimize for inference (drop gradient buffers)
lm.memory_manager.optimize_for_inference()
# Get current memory info
info = lm.memory_manager.get_memory_info()
# {'allocated_mb': 12.3, 'max_mb': 256.0, 'usage_percent': 4.8, ...}Ten complete examples are in the examples/ folder:
| File | What it demonstrates |
|---|---|
01_basic_tensors.py |
Tensor creation, arithmetic, autograd from scratch |
02_linear_regression.py |
Linear regression with SGD, custom training loop |
03_mlp_classification.py |
XOR classification with Sequential, Adam, DataLoader |
04_mnist_like.py |
Full pipeline: MicroMLP + Trainer + EarlyStopping + ModelCheckpoint |
05_cnn_image.py |
MicroCNN for image classification, BatchNorm, MaxPool |
06_optimizers_comparison.py |
Benchmark SGD vs Adam vs RMSprop vs AdaGrad |
07_custom_layer.py |
Build custom attention layer, LayerNorm, transformer block |
08_save_load_model.py |
Save/load weights, state dict, transfer learning |
09_lr_schedulers.py |
Compare 6 LR scheduler strategies |
10_raspberry_pi_monitor.py |
System monitoring, memory tracing, health scoring |
Run any example:
cd lowmind_repo
python examples/01_basic_tensors.py
python examples/04_mnist_like.pylowmind/
├── lowmind/ # Main package
│ ├── __init__.py # Public API — all exports here
│ ├── core/
│ │ ├── tensor.py # Tensor class + autograd engine
│ │ ├── memory.py # MemoryManager (LRU, GC optimization)
│ │ └── module.py # Module base class (save/load, parameter iteration)
│ ├── nn/
│ │ ├── layers.py # Linear, Conv2d, BatchNorm, Pool, Flatten, Dropout, Embedding
│ │ ├── activation.py # ReLU, LeakyReLU, ELU, GELU, Sigmoid, Tanh, Softmax
│ │ ├── loss.py # cross_entropy, bce, mse, mae, huber, nll
│ │ └── sequential.py # Sequential container
│ ├── optim/
│ │ ├── sgd.py # SGD + Nesterov momentum
│ │ ├── adam.py # Adam, AdamW, RMSprop, AdaGrad
│ │ └── scheduler.py # StepLR, CosineAnnealingLR, ReduceLROnPlateau, ...
│ ├── data/
│ │ └── dataloader.py # Dataset, TensorDataset, DataLoader, train_test_split
│ ├── utils/
│ │ ├── metrics.py # accuracy, precision, recall, f1, r2, ...
│ │ ├── trainer.py # Trainer (high-level training loop)
│ │ ├── callbacks.py # EarlyStopping, ModelCheckpoint, History, ...
│ │ └── monitor.py # SystemMonitor, memory_trace
│ └── models/
│ └── micro_cnn.py # MicroMLP, MicroCNN, TinyResNet
├── examples/ # 10 complete runnable examples
├── tests/ # pytest test suite
├── docs/ # Extended documentation
├── setup.py
├── requirements.txt
└── README.md
import lowmind as lm
# 1. Set memory limit appropriate for your Pi model
lm.configure_memory(max_mb=64) # Pi Zero / 512MB Pi
lm.configure_memory(max_mb=128) # Pi 3 (1GB)
lm.configure_memory(max_mb=256) # Pi 4 (2GB+)
# 2. Use small batch sizes
loader = lm.DataLoader(ds, batch_size=8) # Pi Zero
loader = lm.DataLoader(ds, batch_size=16) # Pi 3/4
# 3. Use Pi-optimized architectures
model = lm.MicroMLP(784, [64], 10) # smallest
model = lm.MicroCNN(in_channels=1, num_classes=10, input_size=28)
# 4. Monitor health during training
monitor = lm.SystemMonitor()
if monitor.health_score() < 40:
print("Warning: system under stress — reduce batch size")
# 5. Free memory after training
lm.memory_manager.optimize_for_inference()
import gc; gc.collect()
# 6. Reduce model size for inference
model.save('/tmp/model.lmz', compress=True) # ~70% smaller than plainContributions are welcome! Areas where help is needed:
- Performance benchmarks on more Pi models
- LSTM / GRU layers
- Quantization (INT8 inference)
- Distributed training across multiple Pis
Submitting a PR:
- Fork and create a feature branch
- Run tests:
pytest tests/ -v - Add tests for new features
- Submit a PR with a clear description
pip install pytest
pytest tests/ -vMIT License — see LICENSE
Built with care in India by Dhaval Vedra
Empowering AI at the edge — from data centers down to $35 computers