Skip to content

SergeySetti/data_linkage

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Record Linkage Workflow

The project contains basic record linkage workflow. The beauty of the experiment here is an approach to speed up the candidate pair generation step by using a simple blocking strategy based on the first letter of the name. This significantly reduces the number of comparisons needed, making the process more efficient. As result - lightning-fast candidate linkage even on a large dataset.

This notebook introduces a record linkage workflow that includes a computation for candidate pairs and calculates scores. Additionally, the script for creating synthetic data has been included to support testing and development. The requirements.txt file contains the necessary packages to run the notebook.

The data itself is not included but syntetic data can be generated using the provided scrip

Synthetic Data Generation

src/syntetic.py generates a large-scale fake dataset (10M records by default) with controlled duplicate rates for benchmarking the linkage pipeline.

Parameters

Parameter Default Description
n 10_000_000 Number of records to generate
noize 0.003 Noise factor — controls uniqueness. A lower value produces more unique records. Internally converted to nf = round(1 / noize), which defines the range of random suffixes appended to names and emails.
seed 0 Random seed for reproducibility (applies to both Faker and NumPy)

How it works

  1. Random suffix injection — Each name gets a random integer suffix in [1, nf], and each email gets rnd_int // 2. This reduces Faker's natural duplicate rate to a realistic level (~9.5% duplicate names, ~7.5% duplicate emails at default settings).
  2. Caching — Generated data is saved as a Parquet file (data/fake_data_people_{n}_{noize}.parquet). Subsequent calls with the same parameters skip generation entirely.

Performance optimizations

  • Vectorized random generation — All random integers are produced in a single np.random.default_rng().integers() call instead of per-row random.randint().
  • Batch Faker calls — Names, emails, and addresses are generated into flat lists extended in batches of 10k, avoiding per-iteration dict creation and list.append overhead.
  • Vectorized string concatenationnp.char.add() appends suffixes to all names/emails at once instead of 10M individual Python string concatenations.
  • Single DataFrame construction — The DataFrame is built once from pre-built arrays rather than from a list of 10M dicts.

Usage

from src.syntetic import generate_fake_data

df = generate_fake_data()                     # defaults: 10M rows, noize=0.003
df = generate_fake_data(n=100_000, noize=0.01) # smaller test set

Or run directly:

python -m src.syntetic

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors