Rensa

High-performance MinHash in Rust with Python bindings. 50x+ faster than datasketch in benchmarked deduplication workloads, with near-identical results.

What is Rensa?

Rensa (Swedish for "clean") computes MinHash signatures for similarity estimation and deduplication. If you need to find near-duplicates in large datasets, Rensa does what datasketch does, much faster.

It ships two MinHash variants:

R-MinHash: Rensa's own variant. Very close to datasketch output, fastest option.
C-MinHash: Based on the C-MinHash paper. Slightly different results, backed by formal variance proofs.

Thanks mlabonne for the Colab notebook!

Performance

Benchmarked against datasketch on gretelai/synthetic_text_to_sql (100K rows, 128 permutations, threshold 0.95):

Method	Median Time	Speedup	Accuracy vs Datasketch
Datasketch	42.95s	—	—
R-MinHash	0.83s	52.04x	0.9999 Jaccard
C-MinHash	0.94s	45.77x	0.9992 Jaccard

The accuracy column is the Jaccard similarity between each method's set of deduplicated rows and datasketch's. A score like 0.9999 means R-MinHash and datasketch agree on virtually every row, but not necessarily all rows.

R-MinHash also uses less memory. On the same 100K-row workload, one measured run dropped peak traced memory from about 778 MB (datasketch) to about 493 MB (R-MinHash), roughly 37% lower.

How R-MinHash works

MinHash estimates Jaccard similarity between sets. Apply k random hash functions to a set, keep the minimum value from each. Two sets sharing many elements will produce similar minimums, and the fraction of matching slots estimates the Jaccard index.

Standard implementations generate each permutation as (a * hash(x) + b) mod p, where p is a large prime (typically the Mersenne prime 2^61 - 1). Mathematically clean, but modular reduction is still more expensive than a simple bit shift on modern CPUs.

R-MinHash replaces the modular reduction with multiply-shift hashing:

signature[i] = min { (a[i] * hash(x) + b[i]) >> 32 }  for all x in set

Instead of reducing mod a prime, take the upper 32 bits of the 64-bit multiply-add. This is a proven universal hash family (Dietzfelbinger et al., 1997). In R-MinHash, it is used as a practical approximation that keeps deduplication results very close to datasketch in benchmarks while making the hot path cheaper.

This choice has a useful side effect: since the output is naturally 32 bits, signatures are stored as u32 — 4 bytes per slot instead of 8. For 128 permutations, that's 512 bytes per signature. Half the memory, and twice as many signature slots fit in a cache line.

Performance engineering

On top of the algorithm, Rensa applies several low-level optimizations.

Input elements are hashed with an FNV-1a variant that processes 8 bytes per step. MinHash needs uniform distribution, not collision resistance, so there's no reason to pay for a cryptographic hash function.

Elements are hashed in groups of 32. Permutations are applied to each batch in chunks of 16, using a fixed-size temporary array ([u32; 16]) that is register-friendly. This gives the compiler room to optimize tight loops and keeps working data cache-local.

The (a, b) permutation pairs are deterministic, derived from a seed via Xoshiro256++. They are initialized at construction and reused across updates to avoid recomputing setup state on every incremental update.

The global allocator is MiMalloc, which handles the batch-allocate-then-free pattern better than the system default.

C-MinHash

Rensa also includes C-MinHash, based on the C-MinHash paper. It uses a two-stage scheme (sigma then pi) that reduces the need for k independent permutations by deriving the k slots from a small parameter set. In this implementation, that means sigma_a/sigma_b and pi_c/pi_d, with precomputed pi terms for speed. The paper proves tighter variance bounds than standard MinHash. In practice, both variants produce similar results and R-MinHash is usually a bit faster. Use R-MinHash unless you have a specific reason not to.

Installation

pip install rensa

Works on Linux, macOS, and Windows. Python >= 3.8.

Usage

Computing similarity

from rensa import RMinHash

m1 = RMinHash(num_perm=128, seed=42)
m1.update("the quick brown fox jumps over the lazy dog".split())

m2 = RMinHash(num_perm=128, seed=42)
m2.update("the quick brown fox jumps over the lazy cat".split())

print(m1.jaccard(m2))  # ~0.78

CMinHash has the same interface. Just swap the class name.

Deduplicating a dataset

from datasets import load_dataset
from rensa import RMinHash, RMinHashLSH

dataset = load_dataset("gretelai/synthetic_text_to_sql")["train"]

# Build MinHash signatures
minhashes = {}
for idx, row in enumerate(dataset):
    m = RMinHash(num_perm=128, seed=42)
    m.update(row["sql"].split())
    minhashes[idx] = m

# Index into LSH
lsh = RMinHashLSH(threshold=0.8, num_perm=128, num_bands=16)
for doc_id, mh in minhashes.items():
    lsh.insert(doc_id, mh)

# Find and remove duplicates
to_remove = set()
for doc_id, mh in minhashes.items():
    if doc_id in to_remove:
        continue
    for candidate in lsh.query(mh):
        if candidate != doc_id and candidate not in to_remove:
            if mh.jaccard(minhashes[candidate]) >= 0.85:
                to_remove.add(max(doc_id, candidate))

print(f"Removed {len(to_remove)} duplicates from {len(dataset)} rows")

Streaming deduplication

For continuous data streams, use the built-in deduplicator:

from rensa import RMinHash, RMinHashDeduplicator

dedup = RMinHashDeduplicator(threshold=0.8, num_perm=128, use_lsh=True, num_bands=16)

for doc in document_stream:
    mh = RMinHash(num_perm=128, seed=42)
    mh.update(doc["text"].split())

    if not dedup.is_duplicate(doc["id"], mh):
        dedup.add(doc["id"], mh)
        # process unique document

API

RMinHash / CMinHash

Method	Description
`__init__(num_perm, seed)`	Create a MinHash with `num_perm` permutations
`update(items)`	Add items (list of strings, bytes, or iterables)
`jaccard(other)`	Estimate Jaccard similarity (requires matching `num_perm`)
`digest()`	Return the signature as a list of integers

RMinHashLSH

Method	Description
`__init__(threshold, num_perm, num_bands)`	Create an LSH index. `num_bands` must divide `num_perm` evenly
`insert(key, minhash)`	Add a document to the index
`query(minhash)`	Return candidate similar document keys
`remove(key)`	Remove a document from the index

RMinHashDeduplicator / CMinHashDeduplicator

Method	Description
`RMinHashDeduplicator(threshold, num_perm, use_lsh, num_bands=None)`	R-MinHash streaming deduplicator
`CMinHashDeduplicator(threshold)`	C-MinHash streaming deduplicator
`add(key, minhash) -> bool`	Add if unique, returns whether it was added
`is_duplicate(key, minhash) -> bool`	Check without adding
`get_duplicates(minhash) -> list[str]`	Find keys of similar stored items
`remove(key)` / `clear()`	Manage stored items

Running Benchmarks

git clone https://github.com/beowolx/rensa.git && cd rensa
uv venv && uv sync --group bench --no-install-project
uv run maturin develop --release
uv run python benchmarks/simple_benchmark.py

See benchmarks/ for the full suite, including the 1.8M-row wikitext benchmark.

Contributing

Contributions welcome, just open a PR or issue.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
.bench		.bench
.cargo		.cargo
.devcontainer		.devcontainer
.github/workflows		.github/workflows
assets		assets
benchmarks		benchmarks
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
deny.toml		deny.toml
pyproject.toml		pyproject.toml
rustfmt.toml		rustfmt.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rensa

What is Rensa?

Performance

How R-MinHash works

Performance engineering

C-MinHash

Installation

Usage

Computing similarity

Deduplicating a dataset

Streaming deduplication

API

RMinHash / CMinHash

RMinHashLSH

RMinHashDeduplicator / CMinHashDeduplicator

Running Benchmarks

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

beowolx/rensa

Folders and files

Latest commit

History

Repository files navigation

Rensa

What is Rensa?

Performance

How R-MinHash works

Performance engineering

C-MinHash

Installation

Usage

Computing similarity

Deduplicating a dataset

Streaming deduplication

API

RMinHash / CMinHash

RMinHashLSH

RMinHashDeduplicator / CMinHashDeduplicator

Running Benchmarks

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages