Skip to content

Conversation

@mgrange1998
Copy link
Contributor

Summary:
Add code similarity analysis infrastructure to PrivacyGuard for measuring code memorization via AST structural comparison.

This diff introduces:

  • PyTreeSitterAttack: A new attack node that parses target and model-generated code into Abstract Syntax Trees (ASTs) using tree-sitter, then converts them into zss (Zhang-Shasha) Node trees for downstream tree edit distance analysis. Supports Python and C++ via a language registry with explicit imports. Detects parse failures via tree-sitter's has_error flag and gracefully marks them rather than crashing.
  • CodeSimilarityAnalysisInput: A new BaseAnalysisInput subclass that stores the generation DataFrame with AST columns (target_ast, generated_ast, target_parse_success, generated_parse_success), following the existing TextInclusionAnalysisInput pattern.
  • Pins tree-sitter to v0.25.0 in PACKAGE files for the newer Language/Parser API.

Differential Revision: D93109033

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 12, 2026
@meta-codesync
Copy link

meta-codesync bot commented Feb 12, 2026

@mgrange1998 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D93109033.

mgrange1998 added a commit to mgrange1998/PrivacyGuard-1 that referenced this pull request Feb 12, 2026
Summary:

Add code similarity analysis infrastructure to PrivacyGuard for measuring code memorization via AST structural comparison.

This diff introduces:
- `PyTreeSitterAttack`: A new attack node that parses target and model-generated code into Abstract Syntax Trees (ASTs) using tree-sitter, then converts them into zss (Zhang-Shasha) Node trees for downstream tree edit distance analysis. Supports Python and C++ via a language registry with explicit imports. Detects parse failures via tree-sitter's `has_error` flag and gracefully marks them rather than crashing.
- `CodeSimilarityAnalysisInput`: A new `BaseAnalysisInput` subclass that stores the generation DataFrame with AST columns (`target_ast`, `generated_ast`, `target_parse_success`, `generated_parse_success`), following the existing `TextInclusionAnalysisInput` pattern.
- Pins tree-sitter to v0.25.0 in PACKAGE files for the newer Language/Parser API.

Differential Revision: D93109033
Summary:
Add code similarity analysis infrastructure to PrivacyGuard for measuring code memorization via AST structural comparison. See https://arxiv.org/html/2404.08817v1

This diff introduces:
- `PyTreeSitterAttack`: A new attack node that parses target and model-generated code into Abstract Syntax Trees (ASTs) using tree-sitter, then converts them into zss (Zhang-Shasha) Node trees for downstream tree edit distance analysis. Supports Python and C++ via a language registry with explicit imports.
- **Partial AST support**: Instead of rejecting malformed code entirely, `parse_code` now leverages tree-sitter's error recovery to produce partial ASTs by filtering out ERROR and MISSING nodes. This allows downstream similarity analysis to still detect code memorization even when model-generated code contains syntax errors. Each record is tagged with a `parse_status` of `"success"` or `"partial"` so downstream consumers can distinguish clean parses from filtered ones.
- `CodeSimilarityAnalysisInput`: A new `BaseAnalysisInput` subclass that stores the generation DataFrame with AST columns (`target_ast`, `generated_ast`, `target_parse_status`, `generated_parse_status`), following the existing `TextInclusionAnalysisInput` pattern.
- Pins tree-sitter to v0.25.0 in PACKAGE files for the newer Language/Parser API.

Differential Revision: D93109033
mgrange1998 added a commit to mgrange1998/PrivacyGuard-1 that referenced this pull request Feb 12, 2026
Summary:

Add code similarity analysis infrastructure to PrivacyGuard for measuring code memorization via AST structural comparison. See https://arxiv.org/html/2404.08817v1

This diff introduces:
- `PyTreeSitterAttack`: A new attack node that parses target and model-generated code into Abstract Syntax Trees (ASTs) using tree-sitter, then converts them into zss (Zhang-Shasha) Node trees for downstream tree edit distance analysis. Supports Python and C++ via a language registry with explicit imports.
- **Partial AST support**: Instead of rejecting malformed code entirely, `parse_code` now leverages tree-sitter's error recovery to produce partial ASTs by filtering out ERROR and MISSING nodes. This allows downstream similarity analysis to still detect code memorization even when model-generated code contains syntax errors. Each record is tagged with a `parse_status` of `"success"` or `"partial"` so downstream consumers can distinguish clean parses from filtered ones.
- `CodeSimilarityAnalysisInput`: A new `BaseAnalysisInput` subclass that stores the generation DataFrame with AST columns (`target_ast`, `generated_ast`, `target_parse_status`, `generated_parse_status`), following the existing `TextInclusionAnalysisInput` pattern.
- Pins tree-sitter to v0.25.0 in PACKAGE files for the newer Language/Parser API.

Differential Revision: D93109033
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant