Skip to content
/ rensa Public

High-performance MinHash implementation in Rust with Python bindings for efficient similarity estimation and deduplication of large datasets

License

Notifications You must be signed in to change notification settings

beowolx/rensa

Repository files navigation

Rensa

High-performance MinHash in Rust with Python bindings. 50x+ faster than datasketch in benchmarked deduplication workloads, with near-identical results.

What is Rensa?

Rensa (Swedish for "clean") computes MinHash signatures for similarity estimation and deduplication. If you need to find near-duplicates in large datasets, Rensa does what datasketch does, much faster.

It ships two MinHash variants:

  • R-MinHash: Rensa's own variant. Very close to datasketch output, fastest option.
  • C-MinHash: Based on the C-MinHash paper. Slightly different results, backed by formal variance proofs.

Open In Colab   Thanks mlabonne for the Colab notebook!

Performance

Benchmarked against datasketch on gretelai/synthetic_text_to_sql (100K rows, 128 permutations, threshold 0.95):

Deduplication speed: 100K rows

Method Median Time Speedup Accuracy vs Datasketch
Datasketch 42.95s
R-MinHash 0.83s 52.04x 0.9999 Jaccard
C-MinHash 0.94s 45.77x 0.9992 Jaccard

The accuracy column is the Jaccard similarity between each method's set of deduplicated rows and datasketch's. A score like 0.9999 means R-MinHash and datasketch agree on virtually every row, but not necessarily all rows.

R-MinHash also uses less memory. On the same 100K-row workload, one measured run dropped peak traced memory from about 778 MB (datasketch) to about 493 MB (R-MinHash), roughly 37% lower.

Memory usage comparison

How R-MinHash works

MinHash estimates Jaccard similarity between sets. Apply k random hash functions to a set, keep the minimum value from each. Two sets sharing many elements will produce similar minimums, and the fraction of matching slots estimates the Jaccard index.

Standard implementations generate each permutation as (a * hash(x) + b) mod p, where p is a large prime (typically the Mersenne prime 2^61 - 1). Mathematically clean, but modular reduction is still more expensive than a simple bit shift on modern CPUs.

R-MinHash replaces the modular reduction with multiply-shift hashing:

signature[i] = min { (a[i] * hash(x) + b[i]) >> 32 }  for all x in set

Instead of reducing mod a prime, take the upper 32 bits of the 64-bit multiply-add. This is a proven universal hash family (Dietzfelbinger et al., 1997). In R-MinHash, it is used as a practical approximation that keeps deduplication results very close to datasketch in benchmarks while making the hot path cheaper.

This choice has a useful side effect: since the output is naturally 32 bits, signatures are stored as u32 — 4 bytes per slot instead of 8. For 128 permutations, that's 512 bytes per signature. Half the memory, and twice as many signature slots fit in a cache line.

Performance engineering

On top of the algorithm, Rensa applies several low-level optimizations.

Input elements are hashed with an FNV-1a variant that processes 8 bytes per step. MinHash needs uniform distribution, not collision resistance, so there's no reason to pay for a cryptographic hash function.

Elements are hashed in groups of 32. Permutations are applied to each batch in chunks of 16, using a fixed-size temporary array ([u32; 16]) that is register-friendly. This gives the compiler room to optimize tight loops and keeps working data cache-local.

The (a, b) permutation pairs are deterministic, derived from a seed via Xoshiro256++. They are initialized at construction and reused across updates to avoid recomputing setup state on every incremental update.

The global allocator is MiMalloc, which handles the batch-allocate-then-free pattern better than the system default.

C-MinHash

Rensa also includes C-MinHash, based on the C-MinHash paper. It uses a two-stage scheme (sigma then pi) that reduces the need for k independent permutations by deriving the k slots from a small parameter set. In this implementation, that means sigma_a/sigma_b and pi_c/pi_d, with precomputed pi terms for speed. The paper proves tighter variance bounds than standard MinHash. In practice, both variants produce similar results and R-MinHash is usually a bit faster. Use R-MinHash unless you have a specific reason not to.

Installation

pip install rensa

Works on Linux, macOS, and Windows. Python >= 3.8.

Usage

Computing similarity

from rensa import RMinHash

m1 = RMinHash(num_perm=128, seed=42)
m1.update("the quick brown fox jumps over the lazy dog".split())

m2 = RMinHash(num_perm=128, seed=42)
m2.update("the quick brown fox jumps over the lazy cat".split())

print(m1.jaccard(m2))  # ~0.78

CMinHash has the same interface. Just swap the class name.

Deduplicating a dataset

from datasets import load_dataset
from rensa import RMinHash, RMinHashLSH

dataset = load_dataset("gretelai/synthetic_text_to_sql")["train"]

# Build MinHash signatures
minhashes = {}
for idx, row in enumerate(dataset):
    m = RMinHash(num_perm=128, seed=42)
    m.update(row["sql"].split())
    minhashes[idx] = m

# Index into LSH
lsh = RMinHashLSH(threshold=0.8, num_perm=128, num_bands=16)
for doc_id, mh in minhashes.items():
    lsh.insert(doc_id, mh)

# Find and remove duplicates
to_remove = set()
for doc_id, mh in minhashes.items():
    if doc_id in to_remove:
        continue
    for candidate in lsh.query(mh):
        if candidate != doc_id and candidate not in to_remove:
            if mh.jaccard(minhashes[candidate]) >= 0.85:
                to_remove.add(max(doc_id, candidate))

print(f"Removed {len(to_remove)} duplicates from {len(dataset)} rows")

Streaming deduplication

For continuous data streams, use the built-in deduplicator:

from rensa import RMinHash, RMinHashDeduplicator

dedup = RMinHashDeduplicator(threshold=0.8, num_perm=128, use_lsh=True, num_bands=16)

for doc in document_stream:
    mh = RMinHash(num_perm=128, seed=42)
    mh.update(doc["text"].split())

    if not dedup.is_duplicate(doc["id"], mh):
        dedup.add(doc["id"], mh)
        # process unique document

API

RMinHash / CMinHash

Method Description
__init__(num_perm, seed) Create a MinHash with num_perm permutations
update(items) Add items (list of strings, bytes, or iterables)
jaccard(other) Estimate Jaccard similarity (requires matching num_perm)
digest() Return the signature as a list of integers

RMinHashLSH

Method Description
__init__(threshold, num_perm, num_bands) Create an LSH index. num_bands must divide num_perm evenly
insert(key, minhash) Add a document to the index
query(minhash) Return candidate similar document keys
remove(key) Remove a document from the index

RMinHashDeduplicator / CMinHashDeduplicator

Method Description
RMinHashDeduplicator(threshold, num_perm, use_lsh, num_bands=None) R-MinHash streaming deduplicator
CMinHashDeduplicator(threshold) C-MinHash streaming deduplicator
add(key, minhash) -> bool Add if unique, returns whether it was added
is_duplicate(key, minhash) -> bool Check without adding
get_duplicates(minhash) -> list[str] Find keys of similar stored items
remove(key) / clear() Manage stored items

Running Benchmarks

git clone https://github.com/beowolx/rensa.git && cd rensa
uv venv && uv sync --group bench --no-install-project
uv run maturin develop --release
uv run python benchmarks/simple_benchmark.py

See benchmarks/ for the full suite, including the 1.8M-row wikitext benchmark.

Contributing

Contributions welcome, just open a PR or issue.

License

MIT

About

High-performance MinHash implementation in Rust with Python bindings for efficient similarity estimation and deduplication of large datasets

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •