BitForge

BitForge shrinks a model export so it can fit on tiny hardware, then keeps it fast by doing three things well:

Weight compression — packs model weights into low-bit storage.
Context trimming — keeps the useful part of the active prompt/cache and drops the rest.
Block pruning — removes low-value blocks first when a device budget is tight.

That third part matters if you want to actually fit and run on ESP-class boards or make a Raspberry Pi demo feel fast. Tiny hardware does not want a giant active context or dead weight.

What this is for

BitForge is useful when you want:

a small, fast local demo
a model export that fits a strict RAM/flash budget
a way to prove “this runs on tiny gear” without pretending it is magic
a faster prompt loop by trimming unnecessary context before inference

What it can do

compress weights with bit packing
prune low-value blocks first when a device budget is tight
compact prompt/context tokens so only the useful part stays active
generate an embedded C project for ESP32, Arduino, or STM32
run a local simulator to estimate speed and memory use
give you a sane path to a fast Raspberry Pi proof-of-concept

What it cannot do

It cannot make a big model truly “free” on tiny hardware.
It cannot turn a normal 7B model into a real-time ESP32 model.
It cannot guarantee high tokens/sec on every board.

If you want super fast, the trick is to use a small model and then reduce everything around it: context, weights, memory churn, and pointless blocks.

Best practical targets

Raspberry Pi: good for proving the idea and getting actual speed.
ESP32-S3: good for tiny demos and tight memory budgets.
Arduino: only for extremely small toy examples.

Install

pip install -e .

Quick start

1) Compress weights

import numpy as np
from bitforge import Quantizer, QuantizationConfig

weights = np.random.randn(256, 256).astype(np.float32)
q = Quantizer(QuantizationConfig(mode="adaptive"))
packed, scale, zero_point = q.quantize_tensor(weights, bits=4)

2) Trim active context

from bitforge.context import ContextCompressor, ContextCompressionConfig

compressor = ContextCompressor(ContextCompressionConfig(max_tokens=128))
result = compressor.compress_tokens(list(range(500)))
print(result.compressed_tokens)

3) Prune to a hardware budget

from bitforge.prune import BlockPruner, PruningConfig

pruner = BlockPruner(PruningConfig(block_size=64, target_keep_ratio=0.15))
export = pruner.prune_to_budget({"layer": weights}, budget_bytes=200_000)
print(export.compression_ratio)

4) Access it from the package root

from bitforge import ContextCompressor, ContextCompressionConfig, BlockPruner, PruningConfig

How to prove it runs fast

For a Raspberry Pi demo, use:

a small model
short context windows
4-bit or 2-bit weights
block pruning for the least important layers
a strict max-token limit

The simulator reports throughput estimates, and the pruning/context tools reduce the amount of work before the model even starts.

CLI you can use

bitforge compress gpt2 --target esp32-s3 --bits 4 --output ./compressed_model
bitforge prune ./compressed_model --budget-bytes 200000
bitforge simulate ./compressed_model --prompt "Hello"

Testing

PYTHONPATH=. pytest -q

Structure

bitforge/compress/quantize.py — low-bit weight packing
bitforge/dequantize.py — restore packed values
bitforge/context.py — keep the useful part of the active context
bitforge/prune.py — prune low-value blocks to fit a budget
bitforge/generate/c_codegen.py — embedded C export
bitforge/simulator.py — local speed/memory simulator
tests/ — correctness tests

Honest status

BitForge is now designed to be useful for real tiny-device demos, not just a flashy claim.

If you want the fastest proof, the winning strategy is:

use a small model
prune it hard
trim context aggressively
deploy to Raspberry Pi or ESP32-S3
keep the active window tiny so tokens/sec stays high

That’s the actual path to a convincing demo on small hardware.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
bitforge		bitforge
src/tests		src/tests
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BitForge

What this is for

What it can do

What it cannot do

Best practical targets

Install

Quick start

1) Compress weights

2) Trim active context

3) Prune to a hardware budget

4) Access it from the package root

How to prove it runs fast

CLI you can use

Testing

Structure

Honest status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BitForge

What this is for

What it can do

What it cannot do

Best practical targets

Install

Quick start

1) Compress weights

2) Trim active context

3) Prune to a hardware budget

4) Access it from the package root

How to prove it runs fast

CLI you can use

Testing

Structure

Honest status

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages