Common questions about V2's design, scope, behavior, and limitations.
V2 decides whether a candidate code change should be applied to a Python repository — before any modification is attempted. It produces a SAFE / REVIEW / BLOCK decision for each candidate function-pair, backed by deterministic semantic and behavioral analysis.
No. AI code review tools (which post comments on PRs after code is written) use LLMs to read diffs and generate natural-language feedback. V2 is a pre-modification gate that runs before patches reach the codebase, and uses deterministic embeddings — not LLMs — in the decision path.
No. SAST tools scan for known vulnerability patterns from established taxonomies (CWE, OWASP). V2 analyzes semantic and behavioral similarity between function pairs to decide if they're safe to merge or modify together. Different problem, different output.
Three reasons:
- Explainability — every V2 decision can be traced to specific engine signals (semantic score, behavioral tags, fusion logic).
- Reproducibility — same input produces same output, every time.
- Cost — V2 has no per-call API cost.
LLM scaffolding exists in V2's codebase (ai/ai_interface.py,
ai/llm_adapter.py) but is disabled. LLM integration is V3 scope.
Python only. V2's parser is built on Python's standard ast module.
Multi-language support is V3 scope.
Python 3.11 or newer. Some V2 modules use match-case statements and modern type-hint syntax that require 3.11+.
No. V2 has no LLM API dependencies.
Approximately 90MB for sentence-transformers/all-MiniLM-L6-v2.
Downloaded once to your local cache and reused for every subsequent run.
The current release is configured for all-MiniLM-L6-v2. Swapping the
embedding model is an open research question — different models would
produce different semantic scores, which would require recalibration of
the fusion thresholds. This is documented as future calibration work.
No. V2 runs entirely locally. The embedding model is downloaded once from Hugging Face's model hub, then runs offline. No data is transmitted externally during evaluation.
Depends on repository size:
| Repository size | Approximate runtime |
|---|---|
| <50 files | 5–15 seconds |
| 50–200 files | 15–60 seconds |
| 200–1000 files | 1–3 minutes |
| 1000+ files | 3–10 minutes |
The decision pipeline is bounded by the pair_cap parameter. Larger
caps produce more analyses but take longer.
Three locations:
- V2 orchestrator runs (V2's own codebase):
tests/output/v2/v2_orchestrator_report.json - External repository evaluations:
tests/output/v2/v2_1_repo_evaluation/<repo_name>_report.json - Per-test reports:
tests/output/v2/<test_category>_reports/<test_id>_report.json
python -m tests.intelligence.fusion_tests.tc_v2_047_repo_evaluation \
/path/to/your/repo \
50Replace 50 with your desired pair cap. Use 25 for fast smoke testing,
100 for thorough evaluation.
Yes — V2 is a Python CLI that returns JSON. Integration with CI/CD is straightforward, but CI-specific tooling (GitHub Actions integrations, GitLab CI hooks, etc.) is not yet packaged. Pull requests welcome.
For each candidate function pair V2 analyzes:
- SAFE — Functions are similar in low-risk ways. Suggested governance action: AUTO_APPLY.
- REVIEW — Functions look similar but human judgment is needed. Suggested governance action: BATCH_APPROVAL or INDIVIDUAL_APPROVAL.
- BLOCK — Functions are too different to safely merge, OR they have opposing behaviors (e.g., one backs up while the other deletes). Suggested governance action: FREEZE_PATCH or escalated review.
In a 14-pair audit of V2 BLOCK and opposing-behavior decisions:
- 60% of sampled BLOCK decisions (Groups A+B, 10 pairs) were classified as governance-significant distinctions worth surfacing (real semantic or behavioral differences).
- 30% were family-pattern distinctions — technically correct but no merge intent in the first place (e.g., decorator factories, encode_* families).
- 10% were embedding-model limitations — functions are closely related but the model under-scored similarity.
All 4 opposing-behavior detections (Group C, exhaustive)
were classified GENUINE — all carry FREEZE_PATCH governance action.
The 40% that aren't governance-significant are not false positives.
They are correct BLOCK decisions on pairs that nobody was going to merge
anyway. See V2_BLOCK_PRECISION_AUDIT.md for full reasoning.
Two reasons:
- Conservative default — when fusion signals conflict, V2 defaults to BLOCK to surface the pair for human review rather than risk a wrong AUTO_APPLY.
- Token-overlap pair extraction — V2's pair selection currently uses shared naming tokens, which can pull in pairs that share a token but are functionally unrelated. This is documented as future calibration work (use V1's duplicate-detection output as candidate source).
Common reasons:
- The functions don't share enough naming tokens → V2's candidate
selection didn't pair them. Increase
pair_capor use a different candidate-extraction strategy. - The semantic difference is subtle and the embedding model rates them as similar → known embedding-model limit; see audit document.
- The functions are in test files → V2 deliberately filters test-to-test pairs.
- The functions are in backup or archive files → V2 deliberately
skips
.bak,_old, etc.
- V2 cannot generate patches autonomously (V3 scope).
- V2 has no interactive human-in-the-loop UI (V3 scope).
- V2 cannot govern non-Python code (V3 scope).
- V2 cannot replace LLM-based PR review for design or stylistic concerns.
- V2 cannot guarantee zero false positives.
A 14-pair audit (10 sampled + 4 exhaustive opposing-behavior detections) was the deliberate scope chosen for V2's initial release — balancing audit credibility against author time investment for a single-reviewer audit. A larger audit (50–100 pairs) with multi-reviewer inter-rater agreement is documented as future calibration work.
Yes. Three of ten classifications were independently re-verified by the
author against local repository source code (3/3 confirmed). The other
seven rely on AI-assisted code reading only. This methodology is fully
documented in V2_BLOCK_PRECISION_AUDIT.md.
To span codebase sizes (9 to 4,426 files), function-family patterns (decorator-heavy click, encoder-heavy httpx, transaction-heavy Django, model-heavy transformers), and to include both small targets (Flask tutorial) and large production codebases (Django, transformers).
The repositories are not a random sample of all Python projects — they skew toward framework/library code. Application code may show different patterns. This is documented as a caveat in the audit and README.
V1 was the first release: a rule-based safe-merge reasoning system that identified duplicate functions and applied conservative merge governance. V2 adds three new analysis stages on top of V1:
- Semantic similarity — embedding-based scoring of function pairs
- Behavioral signatures — AST-based classification of what functions do (FILE_WRITE, DELETE_OPERATION, BACKUP_OPERATION, etc.)
- Multi-signal fusion — combines semantic + behavioral + risk scores into a single SAFE/REVIEW/BLOCK decision with conservative defaults.
V1 remains in V2 as the duplicate-detection ground truth, exposed via
ai/v1_adapter.py.
ai/decision_orchestrator.py (class DecisionOrchestrator) is the layer
that wires the four engines together: semantic, behavioral, fusion, risk.
It exposes two methods:
analyze_function_pair(file_path, function_a, function_b)— production entry point. Reads code, invokes all four engines, returns a decision.analyze_signals(...)— testing entry point. Accepts pre-computed signals for unit-testing fusion logic in isolation.
Deterministic governance gates are:
- Auditable — every decision can be explained from the engine signals.
- Reproducible — same input produces same output across runs.
- Cost-bounded — no per-call API charges.
- Independent of LLM availability — works in air-gapped environments.
V3 will integrate LLM capabilities for patch generation and interactive review, but the governance gate itself will remain deterministic.
Yes — V2 is GPLv3 open source. Pull requests, issue reports, and feedback are welcome via the GitHub repository.
Documented in the README under "What's Next — V3 Scope" and in the
Limitations section:
- Behavioral engine family-pattern calibration (V2.3 — suppressing decorator/naming-family noise at candidate selection stage)
- Multi-reviewer precision audit expansion (50–100 pairs)
- Interactive HITL reviewer UI
- CI/CD integration packaging
- Multi-language support beyond Python
Yes. Per GPLv3's terms, contributions become part of the GPLv3-licensed project. If you need a different license for your use case, the project offers commercial licensing — see the README for details.
GNU General Public License v3.0 (GPLv3). Full license text in
LICENSE.
GPLv3 permits commercial use. However, any derivative work must also be released under GPLv3 (and made open source). If you need to integrate V2 into proprietary commercial software without open-sourcing your modifications, a commercial license is available — contact the author.
V2 is designed for governance use cases including regulated environments. It runs entirely locally, has no external API calls, produces auditable deterministic decisions, and maintains a complete audit trail. However, V2 is research prototype software released under GPLv3 as-is, without warranty. Compliance certification (SOC 2, HIPAA, FedRAMP, etc.) is the deploying organization's responsibility.
No. V2's audit findings (e.g., classifications of function pairs in click, DRF, Django, transformers, etc.) describe V2's behavior on these codebases at the evaluation timestamp. They are not judgments on the quality, design, or correctness of the upstream code. The Attribution and Scope section in both the README and the audit document makes this explicit.
This FAQ is provided for educational and open-source collaboration purposes. It does not constitute legal advice, formal benchmark claims, or commercial product comparisons. All software is released under GPLv3 as-is, without warranty of any kind.