Skip to content

Conversation

@mgrange1998
Copy link
Contributor

Summary:
Add the tree edit distance analysis node to PrivacyGuard, completing the code memorization measurement pipeline.

This diff introduces:

  • TreeEditDistanceNode: A new BaseAnalysisNode that computes normalized tree edit distance similarity between AST pairs produced by PyTreeSitterAttack (from Diff 1). Uses the Zhang-Shasha algorithm via zss.simple_distance() with normalization max(1 - distance / max(n1, n2), 0) to produce a 0-1 similarity score. Supports per-language grouping when a language column is present.
  • TreeEditDistanceNodeOutput: A BaseAnalysisOutput dataclass with fields for num_samples, num_both_parsed, per_sample_similarity, avg_similarity, and optional avg_similarity_by_language.
  • Adds both targets to the analysis_library umbrella.

Differential Revision: D93109088

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 12, 2026
@meta-codesync
Copy link

meta-codesync bot commented Feb 12, 2026

@mgrange1998 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D93109088.

Summary:

Add code similarity analysis infrastructure to PrivacyGuard for measuring code memorization via AST structural comparison. See https://arxiv.org/html/2404.08817v1

This diff introduces:
- `PyTreeSitterAttack`: A new attack node that parses target and model-generated code into Abstract Syntax Trees (ASTs) using tree-sitter, then converts them into zss (Zhang-Shasha) Node trees for downstream tree edit distance analysis. Supports Python and C++ via a language registry with explicit imports.
- **Partial AST support**: Instead of rejecting malformed code entirely, `parse_code` now leverages tree-sitter's error recovery to produce partial ASTs by filtering out ERROR and MISSING nodes. This allows downstream similarity analysis to still detect code memorization even when model-generated code contains syntax errors. Each record is tagged with a `parse_status` of `"success"` or `"partial"` so downstream consumers can distinguish clean parses from filtered ones.
- `CodeSimilarityAnalysisInput`: A new `BaseAnalysisInput` subclass that stores the generation DataFrame with AST columns (`target_ast`, `generated_ast`, `target_parse_status`, `generated_parse_status`), following the existing `TextInclusionAnalysisInput` pattern.
- Pins tree-sitter to v0.25.0 in PACKAGE files for the newer Language/Parser API.

Differential Revision: D93109033
Summary:
Add the tree edit distance analysis node to PrivacyGuard, completing the code memorization measurement pipeline. See https://arxiv.org/html/2404.08817v1

This diff introduces:
- `TreeEditDistanceNode`: A new `BaseAnalysisNode` that computes normalized tree edit distance similarity between AST pairs produced by `PyTreeSitterAttack` (from Diff 1). Uses the Zhang-Shasha algorithm via `zss.simple_distance()` with normalization `max(1 - distance / max(n1, n2), 0)` to produce a 0-1 similarity score. Supports per-language grouping when a `language` column is present.
- `TreeEditDistanceNodeOutput`: A `BaseAnalysisOutput` dataclass with fields for `num_samples`, `num_both_parsed`, `per_sample_similarity`, `avg_similarity`, and optional `avg_similarity_by_language`.
- Updated to work with the partial AST parsing from Diff 1: since `PyTreeSitterAttack` now always produces an AST (full or partial), the analysis node computes similarity for all rows unconditionally. Consumers can use the `parse_status` columns from the input to distinguish full vs partial parse results.
- Adds both targets to the `analysis_library` umbrella.

Differential Revision: D93109088
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant