TreeEditDistanceNode for Code Similarity Analysis #107

mgrange1998 · 2026-02-12T18:33:12Z

Summary:
Add the tree edit distance analysis node to PrivacyGuard, completing the code memorization measurement pipeline.

This diff introduces:

TreeEditDistanceNode: A new BaseAnalysisNode that computes normalized tree edit distance similarity between AST pairs produced by PyTreeSitterAttack (from Diff 1). Uses the Zhang-Shasha algorithm via zss.simple_distance() with normalization max(1 - distance / max(n1, n2), 0) to produce a 0-1 similarity score. Supports per-language grouping when a language column is present.
TreeEditDistanceNodeOutput: A BaseAnalysisOutput dataclass with fields for num_samples, num_both_parsed, per_sample_similarity, avg_similarity, and optional avg_similarity_by_language.
Adds both targets to the analysis_library umbrella.

Differential Revision: D93109088

meta-codesync · 2026-02-12T18:33:28Z

@mgrange1998 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D93109088.

Summary: Add code similarity analysis infrastructure to PrivacyGuard for measuring code memorization via AST structural comparison. See https://arxiv.org/html/2404.08817v1 This diff introduces: - `PyTreeSitterAttack`: A new attack node that parses target and model-generated code into Abstract Syntax Trees (ASTs) using tree-sitter, then converts them into zss (Zhang-Shasha) Node trees for downstream tree edit distance analysis. Supports Python and C++ via a language registry with explicit imports. - **Partial AST support**: Instead of rejecting malformed code entirely, `parse_code` now leverages tree-sitter's error recovery to produce partial ASTs by filtering out ERROR and MISSING nodes. This allows downstream similarity analysis to still detect code memorization even when model-generated code contains syntax errors. Each record is tagged with a `parse_status` of `"success"` or `"partial"` so downstream consumers can distinguish clean parses from filtered ones. - `CodeSimilarityAnalysisInput`: A new `BaseAnalysisInput` subclass that stores the generation DataFrame with AST columns (`target_ast`, `generated_ast`, `target_parse_status`, `generated_parse_status`), following the existing `TextInclusionAnalysisInput` pattern. - Pins tree-sitter to v0.25.0 in PACKAGE files for the newer Language/Parser API. Differential Revision: D93109033

Summary: Add the tree edit distance analysis node to PrivacyGuard, completing the code memorization measurement pipeline. See https://arxiv.org/html/2404.08817v1 This diff introduces: - `TreeEditDistanceNode`: A new `BaseAnalysisNode` that computes normalized tree edit distance similarity between AST pairs produced by `PyTreeSitterAttack` (from Diff 1). Uses the Zhang-Shasha algorithm via `zss.simple_distance()` with normalization `max(1 - distance / max(n1, n2), 0)` to produce a 0-1 similarity score. Supports per-language grouping when a `language` column is present. - `TreeEditDistanceNodeOutput`: A `BaseAnalysisOutput` dataclass with fields for `num_samples`, `num_both_parsed`, `per_sample_similarity`, `avg_similarity`, and optional `avg_similarity_by_language`. - Updated to work with the partial AST parsing from Diff 1: since `PyTreeSitterAttack` now always produces an AST (full or partial), the analysis node computes similarity for all rows unconditionally. Consumers can use the `parse_status` columns from the input to distinguish full vs partial parse results. - Adds both targets to the `analysis_library` umbrella. Differential Revision: D93109088

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 12, 2026

meta-codesync bot added fb-exported meta-exported labels Feb 12, 2026

mgrange1998 added 2 commits February 12, 2026 13:37

mgrange1998 force-pushed the export-D93109088 branch from fdcaf92 to 7163f3b Compare February 12, 2026 21:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TreeEditDistanceNode for Code Similarity Analysis #107

TreeEditDistanceNode for Code Similarity Analysis #107

Uh oh!

mgrange1998 commented Feb 12, 2026

Uh oh!

meta-codesync bot commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TreeEditDistanceNode for Code Similarity Analysis #107

Are you sure you want to change the base?

TreeEditDistanceNode for Code Similarity Analysis #107

Uh oh!

Conversation

mgrange1998 commented Feb 12, 2026

Uh oh!

meta-codesync bot commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant