Reference: CLI and Python API

Command-line interface

python3 trigram_compare.py <file_a> <file_b> [options]

Positional arguments

Argument	Description
`file_a`	Path to the first binary file
`file_b`	Path to the second binary file

Options

Option	Default	Description
`--json`	off	Print a JSON object to stdout instead of the formatted report
`--hotspots N`	10	Maximum number of hotspot rows to show in the table
`--coverage` / `--no-coverage`	on	Show or hide the coverage map section
`--window SIZE`	256	Window size in bytes used for both hotspot grid cells and the coverage map base unit. Coverage window is `4 × SIZE`.
`--threshold FLOAT`	0.25	Minimum trigram density (`count / window`) for a cell to become a hotspot. Coverage threshold is `0.6 × FLOAT`.
`--no-color`	off	Disable ANSI escape codes. Also disabled automatically when stdout is not a TTY.

Exit codes

Code	Meaning
0	Analysis completed (any verdict)
1	One or both input files not found

JSON output schema

{
  "files": {
    "a": { "path": "string", "size": int, "total_trigrams": int, "unique_trigrams": int },
    "b": { "path": "string", "size": int, "total_trigrams": int, "unique_trigrams": int }
  },
  "metrics": {
    "jaccard": float,
    "cosine": float,
    "containment_a_in_b": float,
    "containment_b_in_a": float,
    "shared_trigrams": int,
    "unique_to_a": int,
    "unique_to_b": int
  },
  "verdict": "string",
  "hotspots": [
    { "offset_a": int, "offset_b": int, "length": int, "trigram_count": int, "density": float }
  ],
  "coverage_segments": [
    { "start_a": int, "end_a": int, "start_b": int, "end_b": int, "density": float }
  ],
  "elapsed_seconds": float
}

All float values are rounded to 6 decimal places (4 for per-hotspot density).

Python API (`trigram_index` module)

`TrigramIndex`

TrigramIndex(path: str | Path)

Methods

build() -> TrigramIndex
Reads the file via mmap, builds the internal trigram index. Must be called before compare(). Returns self for chaining: TrigramIndex(path).build().

compare(other, hotspot_window=256, hotspot_min_density=0.25, coverage_window=1024, coverage_min_density=0.15) -> SimilarityReport
Compares this index to other. Calls build() on either index if not yet built.

offsets(trigram: bytes | int) -> list[int]
Returns the sorted list of byte offsets where trigram occurs in the file. Accepts a 3-byte bytes object or a pre-packed int.

keys() -> set[int]
Returns the set of all unique trigrams (as packed ints) in the file.

Properties

Property	Type	Description
`path`	`Path`	Resolved file path
`size`	`int`	File size in bytes (set during `build()`)
`total_trigrams`	`int`	`max(0, size - 2)` — total number of trigrams including duplicates
`unique_trigrams`	`int`	Number of distinct trigrams

`SimilarityReport`

Dataclass returned by TrigramIndex.compare().

Field	Type	Description
`path_a`, `path_b`	`str`	Stringified file paths
`size_a`, `size_b`	`int`	File sizes in bytes
`total_trigrams_a/b`	`int`	Total (non-unique) trigram counts
`unique_trigrams_a/b`	`int`	Unique trigram counts
`shared_trigrams`	`int`	`\|keys_a ∩ keys_b\|`
`jaccard`	`float`	Set-based Jaccard similarity
`cosine`	`float`	Frequency-weighted cosine similarity
`containment_a_in_b`	`float`	`shared / unique_a`
`containment_b_in_a`	`float`	`shared / unique_b`
`hotspots`	`list[Hotspot]`	Sorted by `trigram_count` descending, capped at 50
`coverage_segments`	`list[CoverageSegment]`	Sorted by `density` descending, capped at 20
`verdict`	`str` (property)	Classification string (see below)

Verdict thresholds (evaluated in order):

Condition	Verdict
`jaccard >= 0.85`	`NEAR-IDENTICAL`
`jaccard >= 0.50`	`HIGHLY SIMILAR`
`containment_a_in_b >= 0.70` or `containment_b_in_a >= 0.70`	`EMBEDDED CONTENT LIKELY`
`hotspots[0].trigram_count >= 64`	`SHARED CODE REGION DETECTED`
`jaccard >= 0.15`	`MODERATE SIMILARITY`
`jaccard >= 0.05`	`LOW SIMILARITY`
otherwise	`DISSIMILAR`

`Hotspot`

Field	Type	Description
`offset_a`	`int`	Start of the matching window in file A
`offset_b`	`int`	Start of the matching window in file B
`length`	`int`	Window size in bytes (equals `hotspot_window`)
`trigram_count`	`int`	Number of shared trigrams in this cell

Density = trigram_count / length.

`CoverageSegment`

Field	Type	Description
`start_a`, `end_a`	`int`	Byte range in file A
`start_b`, `end_b`	`int`	Corresponding byte range in file B (median-anchored approximation)
`density`	`float`	Peak density within the merged segment
`size_a`	`int` (property)	`end_a - start_a`
`size_b`	`int` (property)	`end_b - start_b`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reference: CLI and Python API

Command-line interface

Positional arguments

Options

Exit codes

JSON output schema

Python API (`trigram_index` module)

`TrigramIndex`

`SimilarityReport`

`Hotspot`

`CoverageSegment`

FilesExpand file tree

reference-cli.md

Latest commit

History

reference-cli.md

File metadata and controls

Reference: CLI and Python API

Command-line interface

Positional arguments

Options

Exit codes

JSON output schema

Python API (trigram_index module)

TrigramIndex

SimilarityReport

Hotspot

CoverageSegment

Python API (`trigram_index` module)

`TrigramIndex`

`SimilarityReport`

`Hotspot`

`CoverageSegment`