-
Notifications
You must be signed in to change notification settings - Fork 0
docs: add paper replay state audit report #95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ProfRandom92
wants to merge
2
commits into
main
Choose a base branch
from
docs/inspect-paper-replay-state-9464230836256924293
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,57 @@ | ||
| # Paper Replay State Audit | ||
|
|
||
| ## Existing Paper Replay Files | ||
|
|
||
| The following files constitute the current paper replay infrastructure: | ||
|
|
||
| - **Tests & Runners:** | ||
| - `tests/test_paper_replay_bench.py`: Implements a benchmark using `KVTCV7Engine`. | ||
| - `tests/utils/paper_replay_runner.py`: Runner for the committed benchmark artifact. **Does not use KVTCV7Engine.** | ||
| - `tests/test_paper_replay_metrics.py`: Validates the schema and determinism of the runner output. | ||
| - **Fixtures:** | ||
| - `tests/fixtures/papers/prefixguard_excerpt.txt` | ||
| - `tests/fixtures/papers/fate_excerpt.txt` | ||
| - `tests/fixtures/papers/self_consolidating_excerpt.txt` | ||
| - **Artifacts:** | ||
| - `artifacts/paper_replay_results.json`: The source of truth for current benchmark metrics. | ||
| - **Documentation:** | ||
| - `docs/paper_replay_benchmark.md`: Overview of the methodology. | ||
| - `docs/benchmarks/paper_replay.md`: Detailed methodology. | ||
|
|
||
| ## Current Validation Logic | ||
|
|
||
| The current benchmark (`paper_replay_runner.py`) validates: | ||
| - **Extraction:** Deterministic parsing of `TITLE:` and `SECTION:` headers into a structured `OperationalState`. | ||
| - **Compaction:** Reduction of text fields to bounded keyword lists and entity sets. | ||
| - **Survival:** Calculation of keyword overlap (`normalized_keyword_overlap`) and entity retention rates. | ||
| - **Consistency:** A derived score (`replay_consistency`) based on field survival thresholds. | ||
|
|
||
| ## Engine vs. Substring Checks | ||
|
|
||
| - **KVTCV7Engine Usage:** The engine is exercised in `tests/test_paper_replay_bench.py` but is **not** used by the main runner that produces `artifacts/paper_replay_results.json`. | ||
| - **Substring/Keyword focus:** The main runner relies on keyword extraction and set overlap rather than the V7 engine's sliding window or compression signals. | ||
|
|
||
| ## Fixture Nature | ||
|
|
||
| Current fixtures are **curated excerpts**. They are not raw PDFs or full-text scrapes. They are pre-formatted with specific headers (`SECTION: problem`, etc.) to facilitate deterministic extraction. | ||
|
|
||
| ## Existing Validation Commands | ||
|
|
||
| - `python -m tests.utils.paper_replay_runner`: Regenerates the JSON artifact. | ||
| - `pytest tests/test_paper_replay_bench.py`: Tests V7 engine integration with paper text. | ||
| - `npm run layout`: Verifies the existence of the artifact. | ||
| - `npm run check`: Runs all repository checks including layout and tests. | ||
|
|
||
| ## Gaps & Risks | ||
|
|
||
| 1. **Bifurcation:** The "official" benchmark results (`paper_replay_results.json`) do not actually measure the `KVTCV7Engine`. They measure a separate keyword-compaction heuristic. | ||
| 2. **Logic Duplication:** Extraction logic is slightly different between the test and the runner (e.g., `test_paper_replay_bench.py` uses a simpler line-based parser compared to the runner's utility). | ||
| 3. **Weak Validation:** While deterministic, keyword-overlap is a "weak" proxy for operational-state preservation compared to the V7 engine's intended use cases. | ||
| 4. **Duplicate Fixture References:** Both the test and the runner hardcode paper specs and fixture paths. | ||
|
|
||
| ## Recommendation for Paper Replay Benchmark v1 | ||
|
|
||
| 1. **Converge on KVTCV7Engine:** Update the runner to use the V7 engine for the compaction step. | ||
| 2. **Unified Extraction:** Extract the extraction logic into a shared utility in `tests/utils/paper_utils.py` (or similar) to avoid duplication. | ||
| 3. **Artifact Alignment:** Ensure `artifacts/paper_replay_results.json` reflects the performance of the actual engine. | ||
| 4. **Test Consolidation:** Merge the metric validation and the bench tests into a consistent suite that guards the V7-backed runner. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The audit identifies two documentation files with overlapping purposes:
docs/paper_replay_benchmark.mdanddocs/benchmarks/paper_replay.md. It would be beneficial to add a recommendation to consolidate these into a single source of truth to avoid documentation drift and fragmentation within the repository.