From 969a7383d4d1fb6b8e7768ca00a6e2649122a7b6 Mon Sep 17 00:00:00 2001 From: krystian Date: Fri, 22 May 2026 22:14:46 -0400 Subject: [PATCH] docs: snapshot beta 6 baseline planning --- README.md | 7 ++++++ docs/diagrams/PIPELINE.md | 24 +++++++++++++++++++ docs/governance/DECISIONS.md | 24 +++++++++++++++++++ docs/governance/SESSION_HANDOFF.md | 26 +++++++++++++++++---- docs/research/BETA_6_FAIL_PRESSURE_PULSE.md | 22 ++++++++++++++++- docs/research/README.md | 15 +++++++++--- 6 files changed, 110 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index 2b9e13d..f9f0483 100644 --- a/README.md +++ b/README.md @@ -21,6 +21,7 @@ Current research stage: - `fail-pressure pulse` - first fixed-prompt pulse: `FAIL` - result: `1` anchor / `13` counted seams / `0` excluded +- status: snapshot for clean-baseline planning Most recently closed beta: @@ -34,6 +35,12 @@ collect comparable evidence, and assign one binary verdict. Its active method is `eval-pulse`: one fixed-prompt pulse receives rows as evidence, and the pulse earns one `PASS` or `FAIL` verdict. +The next research slice should not treat the first Beta `6.0` correction as a +clean comparable baseline. Earlier hard-coded prompt scaffolds were removed, +but their history may still contaminate the interpretation of the current +logic. The next baseline should start from the proper cleaned config and be +compared against this Beta `6.0` snapshot as a separate line. + ## What This Repo Demonstrates - constrained one-node generation through a fixed prompt surface diff --git a/docs/diagrams/PIPELINE.md b/docs/diagrams/PIPELINE.md index 422e028..0343396 100644 --- a/docs/diagrams/PIPELINE.md +++ b/docs/diagrams/PIPELINE.md @@ -77,6 +77,26 @@ flowchart LR I --> J["Rerun one fixed-prompt pulse"] ``` +## Clean-Baseline Comparison Plan + +```mermaid +flowchart LR + A["Beta 6.0 snapshot"] + B["failed why pulse
ids 4850-4863"] + C["grammar-led correction
not clean-baseline proof"] + D["config-history contamination risk"] + + E["Clean baseline candidate"] + F["proper cleaned config"] + G["same fixed-prompt pulse method"] + H["compare verdicts and seam families"] + I["decide reset inside Beta 6.0
or new beta boundary"] + + A --> B --> C --> D + E --> F --> G --> H --> I + D -. "reference line" .-> H +``` + ## Closed Row-Level Gate Stack ```mermaid @@ -124,3 +144,7 @@ The active `Beta 6.0` gate is different: - rows become evidence inside that pulse - pulse evidence is `anchor`, `counted_seam`, or `excluded_noise` - the pulse receives one `PASS / FAIL` verdict + +The clean-baseline comparison is not another eval gate yet. It is a planning +boundary that keeps the current Beta `6.0` snapshot separate from the next +proper-config baseline candidate. diff --git a/docs/governance/DECISIONS.md b/docs/governance/DECISIONS.md index 5098445..fe03a07 100644 --- a/docs/governance/DECISIONS.md +++ b/docs/governance/DECISIONS.md @@ -1065,3 +1065,27 @@ If a decision crosses layers, say so plainly instead of flattening the method in with repeated soft abstraction. The smallest correction is to make the sentence grammar carry more of the shape while still preserving the fixed prompt surface and non-concrete oracle contract. + +## D-057: Snapshot Beta 6.0 for clean-baseline planning + +- Date: `2026-05-22` +- Category: `eval_quality` +- Tags: `beta_6`, `baseline_reset`, `config_contamination`, `comparison_diagram` +- Provenance: `human-led method decision with repo formalization` +- Decision: + - snapshot the current Beta `6.0` correction before running another live pulse + - treat the failed pulse and first correction as diagnostic evidence, not a + clean comparable baseline + - explicitly account for the risk that earlier hard-coded phrase scaffolds + contaminated the logic surface even after those scaffolds were removed + - make the next research slice define a clean baseline from the proper config + before spending more live API calls + - compare the Beta `6.0` snapshot against the clean baseline with a + diagram and the same fixed-prompt pulse method + - keep the rate-limit / prepaid-credit pause in force until that baseline + question is settled +- Why: A prompt cleanup can remove the obvious phrase bank without making the + evidence line clean. If the prior config history shaped the failure surface, + then another small correction would be hard to interpret. The next useful + move is a new clean baseline candidate that can be compared against the + Beta `6.0` snapshot rather than folded into it. diff --git a/docs/governance/SESSION_HANDOFF.md b/docs/governance/SESSION_HANDOFF.md index 481c36e..0df3390 100644 --- a/docs/governance/SESSION_HANDOFF.md +++ b/docs/governance/SESSION_HANDOFF.md @@ -67,6 +67,7 @@ Current active method: - status: - active method - first valid fixed-prompt pulse failed + - snapshot for clean-baseline planning - live eval work paused on rate-limit / prepaid-credit boundary - invalid false starts discarded @@ -147,10 +148,18 @@ Useful current reads: - prefer one clear subject and finite verb - keep imagery secondary to the sentence claim - vary sentence openings across samples +- baseline reset note: + - do not treat that first correction as a clean comparable baseline yet + - prior hard-coded prompt scaffolds may have contaminated the current logic + line even after removal + - the next research slice should define a clean baseline from the proper + config before spending more live API calls + - compare the Beta `6.0` snapshot against that clean baseline as a + separate line, preferably with a diagram - Stop condition for the next session: - do not start another live pulse until rate limits and prepaid credits are confirmed healthy - - use the existing failed pulse as the planning surface first + - use the existing failed pulse and first correction as planning surfaces only - `where` is fully stable in the current surface: - `84 pass / 0 fail` - `what` is close behind: @@ -166,16 +175,25 @@ Choose one lane at a time: - keep the user loop separate from operator commands - research: - keep `Beta 5.1` frozen as the most recently closed row-level beta - - treat `Beta 6.0` as the active pulse-level method + - treat `Beta 6.0` as the active pulse-level method, but snapshot the current + evidence line before rerunning it - preserve the explicit comparison boundary: - row-level `5.1` - pulse-level `6.0` - - run one fixed-prompt pulse for `15` minutes + - define the clean baseline candidate from the proper config before running + another live pulse + - diagram the comparison: + - Beta `6.0` snapshot + - clean baseline candidate + - shared fixed-prompt pulse method + - once the baseline question is settled, run one fixed-prompt pulse for `15` + minutes - keep each fixed prompt in its own pulse - use the one-sample-per-minute pulse default unless the method changes - label rows as pulse evidence only - treat the first valid pulse verdict as `FAIL` - - validate the first grammar-led correction before any live rerun + - do not validate the first grammar-led correction as if it were already a + clean baseline - do not start another live pulse until the rate-limit / prepaid-credit boundary is cleared - docs: diff --git a/docs/research/BETA_6_FAIL_PRESSURE_PULSE.md b/docs/research/BETA_6_FAIL_PRESSURE_PULSE.md index a482c25..1188eda 100644 --- a/docs/research/BETA_6_FAIL_PRESSURE_PULSE.md +++ b/docs/research/BETA_6_FAIL_PRESSURE_PULSE.md @@ -2,7 +2,8 @@ ## Status -Active method, first valid pulse failed. +Active method; first valid pulse failed; current line is a snapshot for +clean-baseline planning. `Research Beta 6.0` uses a fixed-prompt pulse as the binary unit: @@ -116,6 +117,25 @@ That correction targets the repeated soft-drift family while keeping the fixed-prompt pulse method unchanged. The next live pulse should wait until the rate-limit / prepaid-credit boundary is healthy. +## Clean-Baseline Reset Question + +The first correction is not a clean comparable baseline yet. + +Earlier hard-coded prompt scaffolds were removed from the runtime surface, but +the current Beta `6.0` evidence may still be shaped by that prior config +history. Treat the failed pulse and the grammar-led correction as diagnostic +surfaces. Do not fold the next run into the same line until the baseline +question is settled. + +Next research slice: + +- define a clean baseline candidate from the proper cleaned config +- keep the fixed-prompt pulse method unchanged +- compare the Beta `6.0` snapshot against the clean baseline line +- use a diagram to make the comparison boundary explicit +- only then decide whether the clean baseline is a new beta boundary or a + reset inside Beta `6.0` + ## Relationship To Beta 5.1 `Research Beta 5.1` remains the closed row-level baseline: diff --git a/docs/research/README.md b/docs/research/README.md index 2409e18..3c7ec96 100644 --- a/docs/research/README.md +++ b/docs/research/README.md @@ -13,6 +13,7 @@ Current research stage: - `Research Beta 6.0` - `fail-pressure pulse` +- snapshot for clean-baseline planning Most recently closed beta: @@ -60,6 +61,8 @@ Current finding: - `0` excluded - live reruns are paused until rate limits and prepaid credits are healthy again +- the current Beta `6.0` line should be treated as diagnostic rather than clean + comparison evidence until a proper-config baseline is defined Current active method: @@ -68,7 +71,8 @@ Current active method: - one fixed-prompt pulse is judged at a time - label rows as `anchor`, `counted_seam`, or `excluded_noise` - first pulse verdict: `FAIL` -- next work is planning from the failed pulse, not another live run yet +- next work is defining a clean baseline candidate from the proper config, not + another live run yet - keep row-level `5.1` as the comparison surface, not the active method ## Beta Map @@ -80,7 +84,7 @@ Current active method: | `Research Beta 3.0` | Is a coherent line in-lane? | Prompt relevance separated lane control from sentence quality. | | `Research Beta 4.1` | Can coherent drift still be valuable? | Coherent absurdity became a small selective class. | | `Research Beta 5.1` | When does a fail family stay active evidence versus earn eviction? | `retain / evict` stays active, with the instruction surface tightened to preserve shape-first lane control. | -| `Research Beta 6.0` | Can Probaboracle hold shape across a bounded fixed-prompt pulse? | The fixed-prompt pulse becomes the binary unit. | +| `Research Beta 6.0` | Can Probaboracle hold shape across a bounded fixed-prompt pulse? | The fixed-prompt pulse becomes the binary unit, but the first line is held as a snapshot until a clean proper-config baseline is defined. | Active pulse method: @@ -139,8 +143,13 @@ flowchart LR Plans are useful, but they are not evidence. They do not become active method until the repo earns them. -Parked lanes: +Planning lanes: +- clean baseline reset: + - compare the Beta `6.0` snapshot against a proper-config + baseline + - keep the comparison diagram explicit before deciding whether this becomes a + new beta boundary or a reset inside Beta `6.0` - provider portability: - keep OpenAI-native behaviour stable if the runtime surface later widens - leave room for an Azure-compatible path if it becomes necessary