PyCompare is a desktop Windows application for comparing two heterogeneous data files and producing reconciliation statistics and reports. It is fully configurable via JSON — no hardcoded domain logic — making it suitable for telecom, finance, log analysis, and any data-matching task.
- Features
- Quick Start
- Input Format: CSV
- Input Format: Fixed-Length
- JSON Configuration
- Normalization Rules
- Matching Algorithm
- Output
- Worked Example
- Fixed-Length Example
- GUI Reference
- Error Handling
- Performance
- Build Executable
- Dependencies
- License
- Parse CSV (configurable separator) and fixed-length files
- Select fields by column name (
source) or column position (column_index) — only the fields needed for comparison; unmentioned columns are ignored - Normalize fields with per-field regex cleaning chains, datetime parsing, integer/decimal conversion
- Match records using a hash-indexed O(n) algorithm — no nested loops
- Tolerances — configurable tolerance for datetime (seconds), integer (units), and decimal values
- Reports — generates
reconciled.xlsx,only_file1.xlsx,only_file2.xlsx,statistics.csvwith record sequence numbers - GUI — Tkinter interface with real-time logging, progress spinner, and custom color theme
- 100 % configurable — no domain-specific logic baked in
# Install dependencies
pip install -r requirements.txt
# Launch the GUI
python main.py
# or
python -m pycompare
# Build a standalone .exe
python build_exe.py
# → dist/PyCompare.exeAny delimiter-separated file with a header row. Supported separators:
| Separator | separator value |
|---|---|
| Semicolon | ";" |
| Comma | "," |
| Pipe | `" |
| Tab | "TAB" |
Example file (calls.csv):
CLI;CLD;START_TIME;DURATION
+39 347-1234567;+39 02-12345678;01/01/2024 10:00:00;120
+39 347-2345678;+39 02-23456789;01/01/2024 10:05:00;60Fields can be selected by column name (using the source key) or by 0-based column position
(using the column_index key). You only need to specify the fields you care about; extra columns
are silently ignored. Files with UTF-8 BOM are handled automatically.
A text file where every record occupies a fixed number of characters and each field is extracted by its start position and length.
Example file (cdrs.txt):
+39 347-1234567 +39 02-12345678 01/01/2024 10:00:00 120
+39 347-2345678 +39 02-23456789 01/01/2024 10:05:00 60
Field positions are defined in the JSON configuration (see below).
Two JSON configuration files are needed — one per source file — plus a match configuration file.
{
"file_type": "csv",
"separator": ";",
"fields": [
{
"name": "caller",
"source": "CLI",
"type": "string",
"cleaning_regex": [
{ "pattern": "\\D", "replacement": "" },
{ "pattern": "^39", "replacement": "" }
]
},
{
"name": "start_time",
"source": "START_TIME",
"type": "datetime",
"format": "%d/%m/%Y %H:%M:%S"
},
{
"name": "duration",
"source": "DURATION",
"type": "integer"
},
{
"name": "amount",
"source": "AMOUNT",
"type": "decimal"
}
]
}| Key | Type | Required | Description |
|---|---|---|---|
file_type |
string |
yes | "csv" or "fixed_length" |
separator |
string |
for CSV | The column delimiter (";", ",", `" |
fields |
array |
yes | Array of field descriptors (see below) |
| Key | Type | Required | Description |
|---|---|---|---|
name |
string |
yes | Normalized field name used in matching |
source |
string |
for CSV (unless column_index is given) |
Column header name in the CSV file |
column_index |
int |
for CSV (unless source is given) |
0-based column position in the CSV file |
type |
string |
yes | "string", "datetime", "integer", or "decimal" |
cleaning_regex |
array |
no | Ordered list of regex substitutions applied before matching |
format |
string |
for datetime | strptime format string, e.g. "%d/%m/%Y %H:%M:%S" |
start |
int |
for fixed-length | Zero-based start position |
length |
int |
for fixed-length | Number of characters |
You may specify only the fields you need for comparison — there is no requirement to list every column present in the file. Unmentioned columns are simply ignored.
Selecting fields by column name (source):
{
"file_type": "csv",
"separator": ";",
"fields": [
{ "name": "caller", "source": "CLI", "type": "string" },
{ "name": "called", "source": "CLD", "type": "string" }
]
}Selecting fields by column position (column_index, 0-based):
{
"file_type": "csv",
"separator": ";",
"fields": [
{ "name": "caller", "column_index": 0, "type": "string" },
{ "name": "called", "column_index": 1, "type": "string" },
{ "name": "start_time", "column_index": 2, "type": "datetime", "format": "%d/%m/%Y %H:%M:%S" },
{ "name": "duration", "column_index": 3, "type": "integer" }
]
}A field can include both source and column_index; when both are present, column_index
takes precedence (useful as documentation or for fallback).
{ "pattern": "\\D", "replacement": "" }Each rule applies re.sub(pattern, replacement, value) in order.
{
"matching_fields": [
{ "field": "caller" },
{ "field": "start_time", "tolerance_seconds": 3 },
{ "field": "duration", "tolerance_seconds": 5 },
{ "field": "amount", "tolerance": 0.01 }
]
}| Key | Type | Description |
|---|---|---|
field |
string |
The normalized field name (must match name in both file configs) |
tolerance_seconds |
int |
Tolerance for datetime or integer fields — abs(a - b) ≤ value |
tolerance |
float |
Tolerance for decimal fields — abs(a - b) ≤ value |
Fields without a tolerance are part of the composite hash key (exact match required).
- If
cleaning_regexis present, each regex is applied in order. - If no regex rules, the raw string is used as-is.
Input: +39 347-1234567
Regex 1: \D → "" → "393471234567"
Regex 2: ^39 → "" → "3471234567"
Result: "3471234567"
- Strips whitespace.
- Parsed with
datetime.strptime(value, format)using the configuredformat.
Input: "01/01/2024 10:00:00"
Format: "%d/%m/%Y %H:%M:%S"
Result: datetime(2024, 1, 1, 10, 0, 0)
- Strips whitespace.
- Tries
int(value). - Falls back to applying
cleaning_regex(if any), then retries.
Input: " 120 "
Result: 120
- Strips whitespace and replaces
,with.. - Tries
Decimal(value). - Falls back to applying
cleaning_regex, then retries.
Input: "1.234,56"
Result: Decimal("1234.56")
The algorithm avoids O(n²) nested loops by using a hash index:
- Classify fields — Fields with no tolerance (or
type="string") become key fields; fields with tolerance become tolerance fields. - Build composite key — For the key fields, a tuple of values is used as a dictionary key.
- Index File B — A
dict[tuple, list[index]]maps every distinct composite key to the record indices in File B. - Iterate File A — For each record:
- Compute its composite key and look up candidates in the index.
- For each candidate not already matched, apply the tolerance checks.
- If all tolerances pass, the pair is recorded and the candidate is marked as used.
- Result — Unmatched indices are returned as only in file 1 / only in file 2.
Complexity: O(n) — a single pass over file A with hash lookups against file B.
Displayed in the GUI and written to statistics.csv:
| Metric | Description |
|---|---|
| Total File 1 | Number of parsed and normalized records in File 1 |
| Total File 2 | Number of parsed and normalized records in File 2 |
| Reconciled | Pairs of records that matched |
| Only File 1 | Records in File 1 with no match in File 2 |
| Only File 2 | Records in File 2 with no match in File 1 |
| Match % | (Reconciled / Total File 1) × 100 |
| Processing Time | Wall-clock time in seconds |
All written to the output directory:
| File | Content |
|---|---|
reconciled.xlsx |
Side-by-side view of matched records. Columns: F1_Row (record position in File 1, 1-based), F2_Row (record position in File 2, 1-based), F1_{fields}, F2_{fields}, Diff_{numeric/datetime fields} |
only_file1.xlsx |
Records present only in File 1. First column Row shows the record's 1-based position from the original input file. |
only_file2.xlsx |
Records present only in File 2. First column Row shows the record's 1-based position from the original input file. |
statistics.csv |
Summary metrics |
file1.csv (5 call records):
CLI;CLD;START_TIME;DURATION
+39 347-1234567;+39 02-12345678;01/01/2024 10:00:00;120
+39 347-2345678;+39 02-23456789;01/01/2024 10:05:00;60
+39 347-3456789;+39 02-34567890;01/01/2024 10:10:00;90
+39 347-4567890;+39 02-45678901;01/01/2024 10:15:00;45
+39 347-5678901;+39 02-56789012;01/01/2024 10:20:00;30file2.csv (4 call records, slightly different formatting):
CLI;CLD;START_TIME;DURATION
393471234567;390212345678;01/01/2024 10:00:02;118
393472345678;390223456789;01/01/2024 10:05:01;62
393473456789;390234567890;01/01/2024 10:10:00;91
999999999999;999999999999;01/01/2024 11:00:00;10file1_config.json: removes all non-digits then strips the Italian prefix 39 from caller/called.
{
"file_type": "csv",
"separator": ";",
"fields": [
{
"name": "caller",
"source": "CLI",
"type": "string",
"cleaning_regex": [
{ "pattern": "\\D", "replacement": "" },
{ "pattern": "^39", "replacement": "" }
]
},
{
"name": "called",
"source": "CLD",
"type": "string",
"cleaning_regex": [
{ "pattern": "\\D", "replacement": "" },
{ "pattern": "^39", "replacement": "" }
]
},
{
"name": "start_time",
"source": "START_TIME",
"type": "datetime",
"format": "%d/%m/%Y %H:%M:%S"
},
{
"name": "duration",
"source": "DURATION",
"type": "integer"
}
]
}file2_config.json: uses column_index (0-based) to select fields by position instead of by header name. The header row is still consumed but ignored for field lookup.
{
"file_type": "csv",
"separator": ";",
"fields": [
{
"name": "caller",
"column_index": 0,
"type": "string",
"cleaning_regex": [
{ "pattern": "^39", "replacement": "" }
]
},
{
"name": "called",
"column_index": 1,
"type": "string",
"cleaning_regex": [
{ "pattern": "^39", "replacement": "" }
]
},
{
"name": "start_time",
"column_index": 2,
"type": "datetime",
"format": "%d/%m/%Y %H:%M:%S"
},
{
"name": "duration",
"column_index": 3,
"type": "integer"
}
]
}{
"matching_fields": [
{ "field": "caller" },
{ "field": "called" },
{ "field": "start_time", "tolerance_seconds": 3 },
{ "field": "duration", "tolerance_seconds": 5 }
]
}caller and called are exact-match string fields (the composite key).
start_time allows ±3 seconds.
duration allows ±5 units.
from pycompare.reconcile import ReconcileEngine
engine = ReconcileEngine()
stats = engine.run(
file1_path="file1.csv",
file1_config_path="file1_config.json",
file2_path="file2.csv",
file2_config_path="file2_config.json",
match_config_path="match_config.json",
output_dir="output",
)Total File 1: 5
Total File 2: 4
Reconciled: 3 ← records 1, 2, 3 matched
Only File 1: 2 ← records 4, 5 had no counterpart
Only File 2: 1 ← record 4 had no counterpart
Match %: 60.00%
What happened under the hood:
| Record | File 1 (normalized) | File 2 (normalized) | Match? |
|---|---|---|---|
| 1 | caller=3471234567, called=0212345678, start=10:00:00, dur=120 | caller=3471234567, called=0212345678, start=10:00:02, dur=118 | Yes (times within 2s, durations within 2) |
| 2 | caller=3472345678, called=0223456789, start=10:05:00, dur=60 | caller=3472345678, called=0223456789, start=10:05:01, dur=62 | Yes (times within 1s, durations within 2) |
| 3 | caller=3473456789, called=0234567890, start=10:10:00, dur=90 | caller=3473456789, called=0234567890, start=10:10:00, dur=91 | Yes (exact time, durations within 1) |
| 4 | caller=3474567890, called=045678901, start=10:15:00, dur=45 | (no matching key) | No |
| 5 | caller=3475678901, called=056789012, start=10:20:00, dur=30 | (no matching key) | No |
| — | (no matching key) | caller=9999999999, called=9999999999, start=11:00:00, dur=10 | No |
Configuration — each field declares start (zero-based) and length:
{
"file_type": "fixed_length",
"fields": [
{ "name": "caller", "source": "CLI", "type": "string", "start": 0, "length": 20 },
{ "name": "called", "source": "CLD", "type": "string", "start": 20, "length": 20 },
{ "name": "start_time", "source": "START_TIME", "type": "datetime", "start": 40, "length": 19,
"format": "%d/%m/%Y %H:%M:%S" },
{ "name": "duration", "source": "DURATION", "type": "integer", "start": 59, "length": 5 }
]
}Data file (cdrs.txt):
+39 347-1234567 +39 02-12345678 01/01/2024 10:00:00 120
+39 347-2345678 +39 02-23456789 01/01/2024 10:05:00 60
Each line is exactly 64 characters. The fields are sliced as:
line[0:20]→ callerline[20:40]→ calledline[40:59]→ start_timeline[59:64]→ duration
- Input Configuration — Fill in the six fields using the Browse buttons (or type paths manually):
- File 1, Config File 1 JSON, File 2, Config File 2 JSON, Matching Config JSON
- Output Directory — Defaults to
./output; change via the Output Dir button. - Compare — Starts the pipeline. A modal progress popup with a spinner and logo appears.
- Results — After completion the seven metrics update automatically.
- Log — Scrollable text area showing every pipeline step with timestamps.
- Close — If a comparison is running, you are prompted to confirm exit.
All errors are surfaced as descriptive message boxes and logged:
| Scenario | Behavior |
|---|---|
| File not found | Error popup listing missing files |
| Invalid JSON | Error popup with the parse error |
| Missing required keys | Validation error with details |
| Regex compilation failure | ValueError with pattern and error |
| Datetime parse failure | Warning recorded, record skipped (after 100 errors the process aborts) |
| Column not found in CSV | Error popup listing missing columns |
| Export error | Exception logged, error popup shown |
Tested on 200 000 records per file (190 000 matches):
| Stage | Throughput |
|---|---|
| CSV parsing (csv.DictReader) | ~150 000 rec/s |
| Normalization | ~30 000 rec/s |
| Hash-indexed matching | ~130 000 rec/s |
| Excel report generation | depends on output size |
The algorithm is O(n) and scales linearly — 1 000 000 records per file is supported.
python build_exe.pyOutput: dist/PyCompare.exe — a single-file, no-console Windows application built with PyInstaller.
- Python ≥ 3.12
- pandas ≥ 2.0.0
- openpyxl ≥ 3.1.0
- Pillow ≥ 10.0.0
- PyInstaller ≥ 6.0.0 (for builds)
This project is licensed under the GNU GENERAL PUBLIC LICENSE 3.0 License - see the LICENSE file for details
