High-performance MinHash in Rust with Python bindings. 50x+ faster than datasketch in benchmarked deduplication workloads, with near-identical results.
Rensa (Swedish for "clean") computes MinHash signatures for similarity estimation and deduplication. If you need to find near-duplicates in large datasets, Rensa does what datasketch does, much faster.
It ships two MinHash variants:
- R-MinHash: Rensa's own variant. Very close to datasketch output, fastest option.
- C-MinHash: Based on the C-MinHash paper. Slightly different results, backed by formal variance proofs.
Thanks mlabonne for the Colab notebook!
Benchmarked against datasketch on gretelai/synthetic_text_to_sql (100K rows, 128 permutations, threshold 0.95):
| Method | Median Time | Speedup | Accuracy vs Datasketch |
|---|---|---|---|
| Datasketch | 42.95s | — | — |
| R-MinHash | 0.83s | 52.04x | 0.9999 Jaccard |
| C-MinHash | 0.94s | 45.77x | 0.9992 Jaccard |
The accuracy column is the Jaccard similarity between each method's set of deduplicated rows and datasketch's. A score like 0.9999 means R-MinHash and datasketch agree on virtually every row, but not necessarily all rows.
R-MinHash also uses less memory. On the same 100K-row workload, one measured run dropped peak traced memory from about 778 MB (datasketch) to about 493 MB (R-MinHash), roughly 37% lower.
MinHash estimates Jaccard similarity between sets. Apply k random hash functions to a set, keep the minimum value from each. Two sets sharing many elements will produce similar minimums, and the fraction of matching slots estimates the Jaccard index.
Standard implementations generate each permutation as (a * hash(x) + b) mod p, where p is a large prime (typically the Mersenne prime 2^61 - 1). Mathematically clean, but modular reduction is still more expensive than a simple bit shift on modern CPUs.
R-MinHash replaces the modular reduction with multiply-shift hashing:
signature[i] = min { (a[i] * hash(x) + b[i]) >> 32 } for all x in set
Instead of reducing mod a prime, take the upper 32 bits of the 64-bit multiply-add. This is a proven universal hash family (Dietzfelbinger et al., 1997). In R-MinHash, it is used as a practical approximation that keeps deduplication results very close to datasketch in benchmarks while making the hot path cheaper.
This choice has a useful side effect: since the output is naturally 32 bits, signatures are stored as u32 — 4 bytes per slot instead of 8. For 128 permutations, that's 512 bytes per signature. Half the memory, and twice as many signature slots fit in a cache line.
On top of the algorithm, Rensa applies several low-level optimizations.
Input elements are hashed with an FNV-1a variant that processes 8 bytes per step. MinHash needs uniform distribution, not collision resistance, so there's no reason to pay for a cryptographic hash function.
Elements are hashed in groups of 32. Permutations are applied to each batch in chunks of 16, using a fixed-size temporary array ([u32; 16]) that is register-friendly. This gives the compiler room to optimize tight loops and keeps working data cache-local.
The (a, b) permutation pairs are deterministic, derived from a seed via Xoshiro256++. They are initialized at construction and reused across updates to avoid recomputing setup state on every incremental update.
The global allocator is MiMalloc, which handles the batch-allocate-then-free pattern better than the system default.
Rensa also includes C-MinHash, based on the C-MinHash paper. It uses a two-stage scheme (sigma then pi) that reduces the need for k independent permutations by deriving the k slots from a small parameter set. In this implementation, that means sigma_a/sigma_b and pi_c/pi_d, with precomputed pi terms for speed. The paper proves tighter variance bounds than standard MinHash. In practice, both variants produce similar results and R-MinHash is usually a bit faster. Use R-MinHash unless you have a specific reason not to.
pip install rensaWorks on Linux, macOS, and Windows. Python >= 3.8.
from rensa import RMinHash
m1 = RMinHash(num_perm=128, seed=42)
m1.update("the quick brown fox jumps over the lazy dog".split())
m2 = RMinHash(num_perm=128, seed=42)
m2.update("the quick brown fox jumps over the lazy cat".split())
print(m1.jaccard(m2)) # ~0.78CMinHash has the same interface. Just swap the class name.
from datasets import load_dataset
from rensa import RMinHash, RMinHashLSH
dataset = load_dataset("gretelai/synthetic_text_to_sql")["train"]
# Build MinHash signatures
minhashes = {}
for idx, row in enumerate(dataset):
m = RMinHash(num_perm=128, seed=42)
m.update(row["sql"].split())
minhashes[idx] = m
# Index into LSH
lsh = RMinHashLSH(threshold=0.8, num_perm=128, num_bands=16)
for doc_id, mh in minhashes.items():
lsh.insert(doc_id, mh)
# Find and remove duplicates
to_remove = set()
for doc_id, mh in minhashes.items():
if doc_id in to_remove:
continue
for candidate in lsh.query(mh):
if candidate != doc_id and candidate not in to_remove:
if mh.jaccard(minhashes[candidate]) >= 0.85:
to_remove.add(max(doc_id, candidate))
print(f"Removed {len(to_remove)} duplicates from {len(dataset)} rows")For continuous data streams, use the built-in deduplicator:
from rensa import RMinHash, RMinHashDeduplicator
dedup = RMinHashDeduplicator(threshold=0.8, num_perm=128, use_lsh=True, num_bands=16)
for doc in document_stream:
mh = RMinHash(num_perm=128, seed=42)
mh.update(doc["text"].split())
if not dedup.is_duplicate(doc["id"], mh):
dedup.add(doc["id"], mh)
# process unique document| Method | Description |
|---|---|
__init__(num_perm, seed) |
Create a MinHash with num_perm permutations |
update(items) |
Add items (list of strings, bytes, or iterables) |
jaccard(other) |
Estimate Jaccard similarity (requires matching num_perm) |
digest() |
Return the signature as a list of integers |
| Method | Description |
|---|---|
__init__(threshold, num_perm, num_bands) |
Create an LSH index. num_bands must divide num_perm evenly |
insert(key, minhash) |
Add a document to the index |
query(minhash) |
Return candidate similar document keys |
remove(key) |
Remove a document from the index |
| Method | Description |
|---|---|
RMinHashDeduplicator(threshold, num_perm, use_lsh, num_bands=None) |
R-MinHash streaming deduplicator |
CMinHashDeduplicator(threshold) |
C-MinHash streaming deduplicator |
add(key, minhash) -> bool |
Add if unique, returns whether it was added |
is_duplicate(key, minhash) -> bool |
Check without adding |
get_duplicates(minhash) -> list[str] |
Find keys of similar stored items |
remove(key) / clear() |
Manage stored items |
git clone https://github.com/beowolx/rensa.git && cd rensa
uv venv && uv sync --group bench --no-install-project
uv run maturin develop --release
uv run python benchmarks/simple_benchmark.pySee benchmarks/ for the full suite, including the 1.8M-row wikitext benchmark.
Contributions welcome, just open a PR or issue.
MIT

