PyTreeSitterAttack for Code Edit Distance #106

mgrange1998 · 2026-02-12T18:32:39Z

Summary:
Add code similarity analysis infrastructure to PrivacyGuard for measuring code memorization via AST structural comparison.

This diff introduces:

PyTreeSitterAttack: A new attack node that parses target and model-generated code into Abstract Syntax Trees (ASTs) using tree-sitter, then converts them into zss (Zhang-Shasha) Node trees for downstream tree edit distance analysis. Supports Python and C++ via a language registry with explicit imports. Detects parse failures via tree-sitter's has_error flag and gracefully marks them rather than crashing.
CodeSimilarityAnalysisInput: A new BaseAnalysisInput subclass that stores the generation DataFrame with AST columns (target_ast, generated_ast, target_parse_success, generated_parse_success), following the existing TextInclusionAnalysisInput pattern.
Pins tree-sitter to v0.25.0 in PACKAGE files for the newer Language/Parser API.

Differential Revision: D93109033

meta-codesync · 2026-02-12T18:32:46Z

@mgrange1998 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D93109033.

Summary: Add code similarity analysis infrastructure to PrivacyGuard for measuring code memorization via AST structural comparison. This diff introduces: - `PyTreeSitterAttack`: A new attack node that parses target and model-generated code into Abstract Syntax Trees (ASTs) using tree-sitter, then converts them into zss (Zhang-Shasha) Node trees for downstream tree edit distance analysis. Supports Python and C++ via a language registry with explicit imports. Detects parse failures via tree-sitter's `has_error` flag and gracefully marks them rather than crashing. - `CodeSimilarityAnalysisInput`: A new `BaseAnalysisInput` subclass that stores the generation DataFrame with AST columns (`target_ast`, `generated_ast`, `target_parse_success`, `generated_parse_success`), following the existing `TextInclusionAnalysisInput` pattern. - Pins tree-sitter to v0.25.0 in PACKAGE files for the newer Language/Parser API. Differential Revision: D93109033

Summary: Add code similarity analysis infrastructure to PrivacyGuard for measuring code memorization via AST structural comparison. See https://arxiv.org/html/2404.08817v1 This diff introduces: - `PyTreeSitterAttack`: A new attack node that parses target and model-generated code into Abstract Syntax Trees (ASTs) using tree-sitter, then converts them into zss (Zhang-Shasha) Node trees for downstream tree edit distance analysis. Supports Python and C++ via a language registry with explicit imports. - **Partial AST support**: Instead of rejecting malformed code entirely, `parse_code` now leverages tree-sitter's error recovery to produce partial ASTs by filtering out ERROR and MISSING nodes. This allows downstream similarity analysis to still detect code memorization even when model-generated code contains syntax errors. Each record is tagged with a `parse_status` of `"success"` or `"partial"` so downstream consumers can distinguish clean parses from filtered ones. - `CodeSimilarityAnalysisInput`: A new `BaseAnalysisInput` subclass that stores the generation DataFrame with AST columns (`target_ast`, `generated_ast`, `target_parse_status`, `generated_parse_status`), following the existing `TextInclusionAnalysisInput` pattern. - Pins tree-sitter to v0.25.0 in PACKAGE files for the newer Language/Parser API. Differential Revision: D93109033

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 12, 2026

meta-codesync bot added fb-exported meta-exported labels Feb 12, 2026

mgrange1998 force-pushed the export-D93109033 branch from 8265040 to 115dc48 Compare February 12, 2026 21:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTreeSitterAttack for Code Edit Distance #106

PyTreeSitterAttack for Code Edit Distance #106

Uh oh!

mgrange1998 commented Feb 12, 2026

Uh oh!

meta-codesync bot commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PyTreeSitterAttack for Code Edit Distance #106

Are you sure you want to change the base?

PyTreeSitterAttack for Code Edit Distance #106

Uh oh!

Conversation

mgrange1998 commented Feb 12, 2026

Uh oh!

meta-codesync bot commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant