PyCompare — File Reconciliation Tool

PyCompare is a desktop Windows application for comparing two heterogeneous data files and producing reconciliation statistics and reports. It is fully configurable via JSON — no hardcoded domain logic — making it suitable for telecom, finance, log analysis, and any data-matching task.

Features
Quick Start
Input Format: CSV
Input Format: Fixed-Length
JSON Configuration
- File Configuration
- Match Configuration
Normalization Rules
- string
- datetime
- integer
- decimal
Matching Algorithm
Output
- Statistics
- Reports
Worked Example
Fixed-Length Example
GUI Reference
Error Handling
Performance
Build Executable
Dependencies
License

Features

Parse CSV (configurable separator) and fixed-length files
Select fields by column name (source) or column position (column_index) — only the fields needed for comparison; unmentioned columns are ignored
Normalize fields with per-field regex cleaning chains, datetime parsing, integer/decimal conversion
Match records using a hash-indexed O(n) algorithm — no nested loops
Tolerances — configurable tolerance for datetime (seconds), integer (units), and decimal values
Reports — generates reconciled.xlsx, only_file1.xlsx, only_file2.xlsx, statistics.csv with record sequence numbers
GUI — Tkinter interface with real-time logging, progress spinner, and custom color theme
100 % configurable — no domain-specific logic baked in

Quick Start

# Install dependencies
pip install -r requirements.txt

# Launch the GUI
python main.py
# or
python -m pycompare

# Build a standalone .exe
python build_exe.py
# → dist/PyCompare.exe

Input Format: CSV

Any delimiter-separated file with a header row. Supported separators:

Separator	`separator` value
Semicolon	`";"`
Comma	`","`
Pipe	`"
Tab	`"TAB"`

Example file (calls.csv):

CLI;CLD;START_TIME;DURATION
+39 347-1234567;+39 02-12345678;01/01/2024 10:00:00;120
+39 347-2345678;+39 02-23456789;01/01/2024 10:05:00;60

Fields can be selected by column name (using the source key) or by 0-based column position (using the column_index key). You only need to specify the fields you care about; extra columns are silently ignored. Files with UTF-8 BOM are handled automatically.

Input Format: Fixed-Length

A text file where every record occupies a fixed number of characters and each field is extracted by its start position and length.

Example file (cdrs.txt):

+39 347-1234567     +39 02-12345678     01/01/2024 10:00:00  120
+39 347-2345678     +39 02-23456789     01/01/2024 10:05:00   60

Field positions are defined in the JSON configuration (see below).

JSON Configuration

Two JSON configuration files are needed — one per source file — plus a match configuration file.

1. File Configuration

{
  "file_type": "csv",
  "separator": ";",
  "fields": [
    {
      "name": "caller",
      "source": "CLI",
      "type": "string",
      "cleaning_regex": [
        { "pattern": "\\D", "replacement": "" },
        { "pattern": "^39", "replacement": "" }
      ]
    },
    {
      "name": "start_time",
      "source": "START_TIME",
      "type": "datetime",
      "format": "%d/%m/%Y %H:%M:%S"
    },
    {
      "name": "duration",
      "source": "DURATION",
      "type": "integer"
    },
    {
      "name": "amount",
      "source": "AMOUNT",
      "type": "decimal"
    }
  ]
}

Top-level keys

Key	Type	Required	Description
`file_type`	`string`	yes	`"csv"` or `"fixed_length"`
`separator`	`string`	for CSV	The column delimiter (`";"`, `","`, `"
`fields`	`array`	yes	Array of field descriptors (see below)

Field descriptor

Key	Type	Required	Description
`name`	`string`	yes	Normalized field name used in matching
`source`	`string`	for CSV (unless `column_index` is given)	Column header name in the CSV file
`column_index`	`int`	for CSV (unless `source` is given)	0-based column position in the CSV file
`type`	`string`	yes	`"string"`, `"datetime"`, `"integer"`, or `"decimal"`
`cleaning_regex`	`array`	no	Ordered list of regex substitutions applied before matching
`format`	`string`	for datetime	`strptime` format string, e.g. `"%d/%m/%Y %H:%M:%S"`
`start`	`int`	for fixed-length	Zero-based start position
`length`	`int`	for fixed-length	Number of characters

You may specify only the fields you need for comparison — there is no requirement to list every column present in the file. Unmentioned columns are simply ignored.

Selecting fields by column name (source):

{
  "file_type": "csv",
  "separator": ";",
  "fields": [
    { "name": "caller", "source": "CLI", "type": "string" },
    { "name": "called", "source": "CLD", "type": "string" }
  ]
}

Selecting fields by column position (column_index, 0-based):

{
  "file_type": "csv",
  "separator": ";",
  "fields": [
    { "name": "caller", "column_index": 0, "type": "string" },
    { "name": "called", "column_index": 1, "type": "string" },
    { "name": "start_time", "column_index": 2, "type": "datetime", "format": "%d/%m/%Y %H:%M:%S" },
    { "name": "duration", "column_index": 3, "type": "integer" }
  ]
}

A field can include both source and column_index; when both are present, column_index takes precedence (useful as documentation or for fallback).

Regex rule

{ "pattern": "\\D", "replacement": "" }

Each rule applies re.sub(pattern, replacement, value) in order.

2. Match Configuration

{
  "matching_fields": [
    { "field": "caller" },
    { "field": "start_time", "tolerance_seconds": 3 },
    { "field": "duration", "tolerance_seconds": 5 },
    { "field": "amount", "tolerance": 0.01 }
  ]
}

Key	Type	Description
`field`	`string`	The normalized field name (must match `name` in both file configs)
`tolerance_seconds`	`int`	Tolerance for `datetime` or `integer` fields — `abs(a - b) ≤ value`
`tolerance`	`float`	Tolerance for `decimal` fields — `abs(a - b) ≤ value`

Fields without a tolerance are part of the composite hash key (exact match required).

Normalization Rules

string

If cleaning_regex is present, each regex is applied in order.
If no regex rules, the raw string is used as-is.

Input:    +39 347-1234567
Regex 1:  \D → ""      → "393471234567"
Regex 2:  ^39 → ""     → "3471234567"
Result:                 "3471234567"

datetime

Strips whitespace.
Parsed with datetime.strptime(value, format) using the configured format.

Input:    "01/01/2024 10:00:00"
Format:   "%d/%m/%Y %H:%M:%S"
Result:   datetime(2024, 1, 1, 10, 0, 0)

integer

Strips whitespace.
Tries int(value).
Falls back to applying cleaning_regex (if any), then retries.

Input:    " 120 "
Result:   120

decimal

Strips whitespace and replaces , with ..
Tries Decimal(value).
Falls back to applying cleaning_regex, then retries.

Input:    "1.234,56"
Result:   Decimal("1234.56")

Matching Algorithm

The algorithm avoids O(n²) nested loops by using a hash index:

Classify fields — Fields with no tolerance (or type="string") become key fields; fields with tolerance become tolerance fields.
Build composite key — For the key fields, a tuple of values is used as a dictionary key.
Index File B — A dict[tuple, list[index]] maps every distinct composite key to the record indices in File B.
Iterate File A — For each record:
- Compute its composite key and look up candidates in the index.
- For each candidate not already matched, apply the tolerance checks.
- If all tolerances pass, the pair is recorded and the candidate is marked as used.
Result — Unmatched indices are returned as only in file 1 / only in file 2.

Complexity: O(n) — a single pass over file A with hash lookups against file B.

Output

Statistics

Displayed in the GUI and written to statistics.csv:

Metric	Description
Total File 1	Number of parsed and normalized records in File 1
Total File 2	Number of parsed and normalized records in File 2
Reconciled	Pairs of records that matched
Only File 1	Records in File 1 with no match in File 2
Only File 2	Records in File 2 with no match in File 1
Match %	`(Reconciled / Total File 1) × 100`
Processing Time	Wall-clock time in seconds

Reports

All written to the output directory:

File	Content
`reconciled.xlsx`	Side-by-side view of matched records. Columns: `F1_Row` (record position in File 1, 1-based), `F2_Row` (record position in File 2, 1-based), `F1_{fields}`, `F2_{fields}`, `Diff_{numeric/datetime fields}`
`only_file1.xlsx`	Records present only in File 1. First column `Row` shows the record's 1-based position from the original input file.
`only_file2.xlsx`	Records present only in File 2. First column `Row` shows the record's 1-based position from the original input file.
`statistics.csv`	Summary metrics

Worked Example

Step 1 – The CSV files

file1.csv (5 call records):

CLI;CLD;START_TIME;DURATION
+39 347-1234567;+39 02-12345678;01/01/2024 10:00:00;120
+39 347-2345678;+39 02-23456789;01/01/2024 10:05:00;60
+39 347-3456789;+39 02-34567890;01/01/2024 10:10:00;90
+39 347-4567890;+39 02-45678901;01/01/2024 10:15:00;45
+39 347-5678901;+39 02-56789012;01/01/2024 10:20:00;30

file2.csv (4 call records, slightly different formatting):

CLI;CLD;START_TIME;DURATION
393471234567;390212345678;01/01/2024 10:00:02;118
393472345678;390223456789;01/01/2024 10:05:01;62
393473456789;390234567890;01/01/2024 10:10:00;91
999999999999;999999999999;01/01/2024 11:00:00;10

Step 2 – File JSON configurations

file1_config.json: removes all non-digits then strips the Italian prefix 39 from caller/called.

{
  "file_type": "csv",
  "separator": ";",
  "fields": [
    {
      "name": "caller",
      "source": "CLI",
      "type": "string",
      "cleaning_regex": [
        { "pattern": "\\D", "replacement": "" },
        { "pattern": "^39", "replacement": "" }
      ]
    },
    {
      "name": "called",
      "source": "CLD",
      "type": "string",
      "cleaning_regex": [
        { "pattern": "\\D", "replacement": "" },
        { "pattern": "^39", "replacement": "" }
      ]
    },
    {
      "name": "start_time",
      "source": "START_TIME",
      "type": "datetime",
      "format": "%d/%m/%Y %H:%M:%S"
    },
    {
      "name": "duration",
      "source": "DURATION",
      "type": "integer"
    }
  ]
}

file2_config.json: uses column_index (0-based) to select fields by position instead of by header name. The header row is still consumed but ignored for field lookup.

{
  "file_type": "csv",
  "separator": ";",
  "fields": [
    {
      "name": "caller",
      "column_index": 0,
      "type": "string",
      "cleaning_regex": [
        { "pattern": "^39", "replacement": "" }
      ]
    },
    {
      "name": "called",
      "column_index": 1,
      "type": "string",
      "cleaning_regex": [
        { "pattern": "^39", "replacement": "" }
      ]
    },
    {
      "name": "start_time",
      "column_index": 2,
      "type": "datetime",
      "format": "%d/%m/%Y %H:%M:%S"
    },
    {
      "name": "duration",
      "column_index": 3,
      "type": "integer"
    }
  ]
}

Step 3 – Match JSON configuration

{
  "matching_fields": [
    { "field": "caller" },
    { "field": "called" },
    { "field": "start_time", "tolerance_seconds": 3 },
    { "field": "duration", "tolerance_seconds": 5 }
  ]
}

caller and called are exact-match string fields (the composite key).
start_time allows ±3 seconds.
duration allows ±5 units.

Step 4 – Run

from pycompare.reconcile import ReconcileEngine

engine = ReconcileEngine()
stats = engine.run(
    file1_path="file1.csv",
    file1_config_path="file1_config.json",
    file2_path="file2.csv",
    file2_config_path="file2_config.json",
    match_config_path="match_config.json",
    output_dir="output",
)

Step 5 – Results

Total File 1:   5
Total File 2:   4
Reconciled:     3    ← records 1, 2, 3 matched
Only File 1:    2    ← records 4, 5 had no counterpart
Only File 2:    1    ← record 4 had no counterpart
Match %:       60.00%

What happened under the hood:

Record	File 1 (normalized)	File 2 (normalized)	Match?
1	caller=3471234567, called=0212345678, start=10:00:00, dur=120	caller=3471234567, called=0212345678, start=10:00:02, dur=118	Yes (times within 2s, durations within 2)
2	caller=3472345678, called=0223456789, start=10:05:00, dur=60	caller=3472345678, called=0223456789, start=10:05:01, dur=62	Yes (times within 1s, durations within 2)
3	caller=3473456789, called=0234567890, start=10:10:00, dur=90	caller=3473456789, called=0234567890, start=10:10:00, dur=91	Yes (exact time, durations within 1)
4	caller=3474567890, called=045678901, start=10:15:00, dur=45	(no matching key)	No
5	caller=3475678901, called=056789012, start=10:20:00, dur=30	(no matching key)	No
—	(no matching key)	caller=9999999999, called=9999999999, start=11:00:00, dur=10	No

Fixed-Length Example

Configuration — each field declares start (zero-based) and length:

{
  "file_type": "fixed_length",
  "fields": [
    { "name": "caller",     "source": "CLI",        "type": "string",   "start": 0,  "length": 20 },
    { "name": "called",     "source": "CLD",        "type": "string",   "start": 20, "length": 20 },
    { "name": "start_time", "source": "START_TIME",  "type": "datetime", "start": 40, "length": 19,
      "format": "%d/%m/%Y %H:%M:%S" },
    { "name": "duration",   "source": "DURATION",   "type": "integer",  "start": 59, "length": 5 }
  ]
}

Data file (cdrs.txt):

+39 347-1234567     +39 02-12345678     01/01/2024 10:00:00  120
+39 347-2345678     +39 02-23456789     01/01/2024 10:05:00   60

Each line is exactly 64 characters. The fields are sliced as:

line[0:20] → caller
line[20:40] → called
line[40:59] → start_time
line[59:64] → duration

GUI Reference

Input Configuration — Fill in the six fields using the Browse buttons (or type paths manually):
- File 1, Config File 1 JSON, File 2, Config File 2 JSON, Matching Config JSON
Output Directory — Defaults to ./output; change via the Output Dir button.
Compare — Starts the pipeline. A modal progress popup with a spinner and logo appears.
Results — After completion the seven metrics update automatically.
Log — Scrollable text area showing every pipeline step with timestamps.
Close — If a comparison is running, you are prompted to confirm exit.

Error Handling

All errors are surfaced as descriptive message boxes and logged:

Scenario	Behavior
File not found	Error popup listing missing files
Invalid JSON	Error popup with the parse error
Missing required keys	Validation error with details
Regex compilation failure	ValueError with pattern and error
Datetime parse failure	Warning recorded, record skipped (after 100 errors the process aborts)
Column not found in CSV	Error popup listing missing columns
Export error	Exception logged, error popup shown

Performance

Tested on 200 000 records per file (190 000 matches):

Stage	Throughput
CSV parsing (csv.DictReader)	~150 000 rec/s
Normalization	~30 000 rec/s
Hash-indexed matching	~130 000 rec/s
Excel report generation	depends on output size

The algorithm is O(n) and scales linearly — 1 000 000 records per file is supported.

Build Executable

python build_exe.py

Output: dist/PyCompare.exe — a single-file, no-console Windows application built with PyInstaller.

Dependencies

Python ≥ 3.12
pandas ≥ 2.0.0
openpyxl ≥ 3.1.0
Pillow ≥ 10.0.0
PyInstaller ≥ 6.0.0 (for builds)

License

This project is licensed under the GNU GENERAL PUBLIC LICENSE 3.0 License - see the LICENSE file for details

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets/logo		assets/logo
icon		icon
pycompare		pycompare
.gitignore		.gitignore
LICENSE		LICENSE
PyCompare.spec		PyCompare.spec
README.md		README.md
build_exe.py		build_exe.py
build_exe.spec		build_exe.spec
main.py		main.py
test_e2e.py		test_e2e.py

Folders and files

Latest commit

History

Repository files navigation

PyCompare — File Reconciliation Tool

Table of Contents

Features

Quick Start

Input Format: CSV

Input Format: Fixed-Length

JSON Configuration

1. File Configuration

Top-level keys

Field descriptor

Regex rule

2. Match Configuration

Normalization Rules

string

datetime

integer

decimal

Matching Algorithm

Output

Statistics

Reports

Worked Example

Step 1 – The CSV files

Step 2 – File JSON configurations

Step 3 – Match JSON configuration

Step 4 – Run

Step 5 – Results

Fixed-Length Example

GUI Reference

Error Handling

Performance

Build Executable

Dependencies

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages