Record Linkage Workflow

The project contains basic record linkage workflow. The beauty of the experiment here is an approach to speed up the candidate pair generation step by using a simple blocking strategy based on the first letter of the name. This significantly reduces the number of comparisons needed, making the process more efficient. As result - lightning-fast candidate linkage even on a large dataset.

This notebook introduces a record linkage workflow that includes a computation for candidate pairs and calculates scores. Additionally, the script for creating synthetic data has been included to support testing and development. The requirements.txt file contains the necessary packages to run the notebook.

The data itself is not included but syntetic data can be generated using the provided scrip

Synthetic Data Generation

src/syntetic.py generates a large-scale fake dataset (10M records by default) with controlled duplicate rates for benchmarking the linkage pipeline.

Parameters

Parameter	Default	Description
`n`	`10_000_000`	Number of records to generate
`noize`	`0.003`	Noise factor — controls uniqueness. A lower value produces more unique records. Internally converted to `nf = round(1 / noize)`, which defines the range of random suffixes appended to names and emails.
`seed`	`0`	Random seed for reproducibility (applies to both Faker and NumPy)

How it works

Random suffix injection — Each name gets a random integer suffix in [1, nf], and each email gets rnd_int // 2. This reduces Faker's natural duplicate rate to a realistic level (~9.5% duplicate names, ~7.5% duplicate emails at default settings).
Caching — Generated data is saved as a Parquet file (data/fake_data_people_{n}_{noize}.parquet). Subsequent calls with the same parameters skip generation entirely.

Performance optimizations

Vectorized random generation — All random integers are produced in a single np.random.default_rng().integers() call instead of per-row random.randint().
Batch Faker calls — Names, emails, and addresses are generated into flat lists extended in batches of 10k, avoiding per-iteration dict creation and list.append overhead.
Vectorized string concatenation — np.char.add() appends suffixes to all names/emails at once instead of 10M individual Python string concatenations.
Single DataFrame construction — The DataFrame is built once from pre-built arrays rather than from a list of 10M dicts.

Usage

from src.syntetic import generate_fake_data

df = generate_fake_data()                     # defaults: 10M rows, noize=0.003
df = generate_fake_data(n=100_000, noize=0.01) # smaller test set

Or run directly:

python -m src.syntetic

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
docs		docs
src		src
.gitignore		.gitignore
README.md		README.md
record_linkage_workflow.ipynb		record_linkage_workflow.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Record Linkage Workflow

Synthetic Data Generation

Parameters

How it works

Performance optimizations

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Record Linkage Workflow

Synthetic Data Generation

Parameters

How it works

Performance optimizations

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages