Skip to content

Latest commit

 

History

History
42 lines (31 loc) · 1.44 KB

File metadata and controls

42 lines (31 loc) · 1.44 KB

HyperLogLog exploration

This repo explores implementations of the HyperLogLog algorithm for estimating the cardinality of a dataset.

These are basic implementations for learning purposes.

Example using text file on disk

> python3 ./py-version/hyperloglog.py ./data/complete_works_of_shakespeare.txt 16
Comparison of HyperLogLog vs Exact Counting
----------------------------------------------------------
Method         Count     Error (%)
----------------------------------------------------------
HyperLogLog    25838     0.37
Exact          25934     N/A
----------------------------------------------------------

Example using sqlite database on disk

This example uses the Wikibooks dataset (license is CC BY-SA 4.0)

Python Version

> python3 ./py-version/hll_sqlite.py ./data/wikibooks.sqlite en body_text 20

Rust Version

cargo run --release ../../data/wikibooks.sqlite en body_text 22

Comparison of HyperLogLog vs Exact Counting (SQLite)
----------------------------------------------------------
Method          Count      Time (s)   Error (%)
----------------------------------------------------------
HyperLogLog     1361262    19.68      0.01
Exact           1361142    12.87      N/A
----------------------------------------------------------