From d4ecc869696f08fa35ebc40c8d55d0f67292c4f0 Mon Sep 17 00:00:00 2001 From: ProfRandom92 <159939812+ProfRandom92@users.noreply.github.com> Date: Sat, 16 May 2026 18:47:11 +0000 Subject: [PATCH 1/2] docs: add paper replay state audit report - Inspect existing paper replay infrastructure (tests, runner, fixtures, artifacts). - Add `docs/paper_replay_state_audit.md` documenting findings. - Link audit report in `README.md`. - Identify bifurcation between `paper_replay_runner.py` and `KVTCV7Engine`. - Provide recommendations for Paper Replay Benchmark v1 alignment. --- README.md | 1 + docs/paper_replay_state_audit.md | 57 ++++++++++++++++++++++++++++++++ 2 files changed, 58 insertions(+) create mode 100644 docs/paper_replay_state_audit.md diff --git a/README.md b/README.md index 9e6ebcc..7b8010a 100644 --- a/README.md +++ b/README.md @@ -102,6 +102,7 @@ Comptextv7 turns noisy context into compact operational state, then validates wh ## Benchmark family ### Paper Replay Benchmark +- **State Audit:** [`docs/paper_replay_state_audit.md`](docs/paper_replay_state_audit.md). - **Validates:** whether dense technical paper summaries preserve entities, metrics, limitations, and section structure after deterministic replay compression. - **Artifact:** [`artifacts/paper_replay_results.json`](artifacts/paper_replay_results.json). diff --git a/docs/paper_replay_state_audit.md b/docs/paper_replay_state_audit.md new file mode 100644 index 0000000..fc58ae0 --- /dev/null +++ b/docs/paper_replay_state_audit.md @@ -0,0 +1,57 @@ +# Paper Replay State Audit + +## Existing Paper Replay Files + +The following files constitute the current paper replay infrastructure: + +- **Tests & Runners:** + - `tests/test_paper_replay_bench.py`: Implements a benchmark using `KVTCV7Engine`. + - `tests/utils/paper_replay_runner.py`: Runner for the committed benchmark artifact. **Does not use KVTCV7Engine.** + - `tests/test_paper_replay_metrics.py`: Validates the schema and determinism of the runner output. +- **Fixtures:** + - `tests/fixtures/papers/prefixguard_excerpt.txt` + - `tests/fixtures/papers/fate_excerpt.txt` + - `tests/fixtures/papers/self_consolidating_excerpt.txt` +- **Artifacts:** + - `artifacts/paper_replay_results.json`: The source of truth for current benchmark metrics. +- **Documentation:** + - `docs/paper_replay_benchmark.md`: Overview of the methodology. + - `docs/benchmarks/paper_replay.md`: Detailed methodology. + +## Current Validation Logic + +The current benchmark (`paper_replay_runner.py`) validates: +- **Extraction:** Deterministic parsing of `TITLE:` and `SECTION:` headers into a structured `OperationalState`. +- **Compaction:** Reduction of text fields to bounded keyword lists and entity sets. +- **Survival:** Calculation of keyword overlap (`normalized_keyword_overlap`) and entity retention rates. +- **Consistency:** A derived score (`replay_consistency`) based on field survival thresholds. + +## Engine vs. Substring Checks + +- **KVTCV7Engine Usage:** The engine is exercised in `tests/test_paper_replay_bench.py` but is **not** used by the main runner that produces `artifacts/paper_replay_results.json`. +- **Substring/Keyword focus:** The main runner relies on keyword extraction and set overlap rather than the V7 engine's sliding window or compression signals. + +## Fixture Nature + +Current fixtures are **curated excerpts**. They are not raw PDFs or full-text scrapes. They are pre-formatted with specific headers (`SECTION: problem`, etc.) to facilitate deterministic extraction. + +## Existing Validation Commands + +- `python -m tests.utils.paper_replay_runner`: Regenerates the JSON artifact. +- `pytest tests/test_paper_replay_bench.py`: Tests V7 engine integration with paper text. +- `npm run layout`: Verifies the existence of the artifact. +- `npm run check`: Runs all repository checks including layout and tests. + +## Gaps & Risks + +1. **Bifurcation:** The "official" benchmark results (`paper_replay_results.json`) do not actually measure the `KVTCV7Engine`. They measure a separate keyword-compaction heuristic. +2. **Logic Duplication:** Extraction logic is slightly different between the test and the runner (e.g., `test_paper_replay_bench.py` uses a simpler line-based parser compared to the runner's utility). +3. **Weak Validation:** While deterministic, keyword-overlap is a "weak" proxy for operational-state preservation compared to the V7 engine's intended use cases. +4. **Duplicate Fixture References:** Both the test and the runner hardcode paper specs and fixture paths. + +## Recommendation for Paper Replay Benchmark v1 + +1. **Converge on KVTCV7Engine:** Update the runner to use the V7 engine for the compaction step. +2. **Unified Extraction:** Extract the extraction logic into a shared utility in `src/validation/paper.py` (or similar) to avoid duplication. +3. **Artifact Alignment:** Ensure `artifacts/paper_replay_results.json` reflects the performance of the actual engine. +4. **Test Consolidation:** Merge the metric validation and the bench tests into a consistent suite that guards the V7-backed runner. From d21c5120203718c482059a042ea30d6bfe9beb5b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Alexander=20K=C3=B6lnberger?= <159939812+ProfRandom92@users.noreply.github.com> Date: Sat, 16 May 2026 12:18:48 -0700 Subject: [PATCH 2/2] Update docs/paper_replay_state_audit.md Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --- docs/paper_replay_state_audit.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/paper_replay_state_audit.md b/docs/paper_replay_state_audit.md index fc58ae0..64e9015 100644 --- a/docs/paper_replay_state_audit.md +++ b/docs/paper_replay_state_audit.md @@ -52,6 +52,6 @@ Current fixtures are **curated excerpts**. They are not raw PDFs or full-text sc ## Recommendation for Paper Replay Benchmark v1 1. **Converge on KVTCV7Engine:** Update the runner to use the V7 engine for the compaction step. -2. **Unified Extraction:** Extract the extraction logic into a shared utility in `src/validation/paper.py` (or similar) to avoid duplication. +2. **Unified Extraction:** Extract the extraction logic into a shared utility in `tests/utils/paper_utils.py` (or similar) to avoid duplication. 3. **Artifact Alignment:** Ensure `artifacts/paper_replay_results.json` reflects the performance of the actual engine. 4. **Test Consolidation:** Merge the metric validation and the bench tests into a consistent suite that guards the V7-backed runner.