Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ Current research stage:
- `fail-pressure pulse`
- first fixed-prompt pulse: `FAIL`
- result: `1` anchor / `13` counted seams / `0` excluded
- status: snapshot for clean-baseline planning

Most recently closed beta:

Expand All @@ -34,6 +35,12 @@ collect comparable evidence, and assign one binary verdict. Its active method
is `eval-pulse`: one fixed-prompt pulse receives rows as evidence, and the
pulse earns one `PASS` or `FAIL` verdict.

The next research slice should not treat the first Beta `6.0` correction as a
clean comparable baseline. Earlier hard-coded prompt scaffolds were removed,
but their history may still contaminate the interpretation of the current
logic. The next baseline should start from the proper cleaned config and be
compared against this Beta `6.0` snapshot as a separate line.

## What This Repo Demonstrates

- constrained one-node generation through a fixed prompt surface
Expand Down
24 changes: 24 additions & 0 deletions docs/diagrams/PIPELINE.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,26 @@ flowchart LR
I --> J["Rerun one fixed-prompt pulse"]
```

## Clean-Baseline Comparison Plan

```mermaid
flowchart LR
A["Beta 6.0 snapshot"]
B["failed why pulse<br/>ids 4850-4863"]
C["grammar-led correction<br/>not clean-baseline proof"]
D["config-history contamination risk"]

E["Clean baseline candidate"]
F["proper cleaned config"]
G["same fixed-prompt pulse method"]
H["compare verdicts and seam families"]
I["decide reset inside Beta 6.0<br/>or new beta boundary"]

A --> B --> C --> D
E --> F --> G --> H --> I
D -. "reference line" .-> H
```

## Closed Row-Level Gate Stack

```mermaid
Expand Down Expand Up @@ -124,3 +144,7 @@ The active `Beta 6.0` gate is different:
- rows become evidence inside that pulse
- pulse evidence is `anchor`, `counted_seam`, or `excluded_noise`
- the pulse receives one `PASS / FAIL` verdict

The clean-baseline comparison is not another eval gate yet. It is a planning
boundary that keeps the current Beta `6.0` snapshot separate from the next
proper-config baseline candidate.
24 changes: 24 additions & 0 deletions docs/governance/DECISIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -1065,3 +1065,27 @@ If a decision crosses layers, say so plainly instead of flattening the method in
with repeated soft abstraction. The smallest correction is to make the
sentence grammar carry more of the shape while still preserving the fixed
prompt surface and non-concrete oracle contract.

## D-057: Snapshot Beta 6.0 for clean-baseline planning

- Date: `2026-05-22`
- Category: `eval_quality`
- Tags: `beta_6`, `baseline_reset`, `config_contamination`, `comparison_diagram`
- Provenance: `human-led method decision with repo formalization`
- Decision:
- snapshot the current Beta `6.0` correction before running another live pulse
- treat the failed pulse and first correction as diagnostic evidence, not a
clean comparable baseline
- explicitly account for the risk that earlier hard-coded phrase scaffolds
contaminated the logic surface even after those scaffolds were removed
- make the next research slice define a clean baseline from the proper config
before spending more live API calls
- compare the Beta `6.0` snapshot against the clean baseline with a
diagram and the same fixed-prompt pulse method
- keep the rate-limit / prepaid-credit pause in force until that baseline
question is settled
- Why: A prompt cleanup can remove the obvious phrase bank without making the
evidence line clean. If the prior config history shaped the failure surface,
then another small correction would be hard to interpret. The next useful
move is a new clean baseline candidate that can be compared against the
Beta `6.0` snapshot rather than folded into it.
26 changes: 22 additions & 4 deletions docs/governance/SESSION_HANDOFF.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ Current active method:
- status:
- active method
- first valid fixed-prompt pulse failed
- snapshot for clean-baseline planning
- live eval work paused on rate-limit / prepaid-credit boundary
- invalid false starts discarded

Expand Down Expand Up @@ -147,10 +148,18 @@ Useful current reads:
- prefer one clear subject and finite verb
- keep imagery secondary to the sentence claim
- vary sentence openings across samples
- baseline reset note:
- do not treat that first correction as a clean comparable baseline yet
- prior hard-coded prompt scaffolds may have contaminated the current logic
line even after removal
- the next research slice should define a clean baseline from the proper
config before spending more live API calls
- compare the Beta `6.0` snapshot against that clean baseline as a
separate line, preferably with a diagram
- Stop condition for the next session:
- do not start another live pulse until rate limits and prepaid credits are
confirmed healthy
- use the existing failed pulse as the planning surface first
- use the existing failed pulse and first correction as planning surfaces only
- `where` is fully stable in the current surface:
- `84 pass / 0 fail`
- `what` is close behind:
Expand All @@ -166,16 +175,25 @@ Choose one lane at a time:
- keep the user loop separate from operator commands
- research:
- keep `Beta 5.1` frozen as the most recently closed row-level beta
- treat `Beta 6.0` as the active pulse-level method
- treat `Beta 6.0` as the active pulse-level method, but snapshot the current
evidence line before rerunning it
- preserve the explicit comparison boundary:
- row-level `5.1`
- pulse-level `6.0`
- run one fixed-prompt pulse for `15` minutes
- define the clean baseline candidate from the proper config before running
another live pulse
- diagram the comparison:
- Beta `6.0` snapshot
- clean baseline candidate
- shared fixed-prompt pulse method
- once the baseline question is settled, run one fixed-prompt pulse for `15`
minutes
- keep each fixed prompt in its own pulse
- use the one-sample-per-minute pulse default unless the method changes
- label rows as pulse evidence only
- treat the first valid pulse verdict as `FAIL`
- validate the first grammar-led correction before any live rerun
- do not validate the first grammar-led correction as if it were already a
clean baseline
- do not start another live pulse until the rate-limit / prepaid-credit boundary
is cleared
- docs:
Expand Down
22 changes: 21 additions & 1 deletion docs/research/BETA_6_FAIL_PRESSURE_PULSE.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@

## Status

Active method, first valid pulse failed.
Active method; first valid pulse failed; current line is a snapshot for
clean-baseline planning.

`Research Beta 6.0` uses a fixed-prompt pulse as the binary unit:

Expand Down Expand Up @@ -116,6 +117,25 @@ That correction targets the repeated soft-drift family while keeping the
fixed-prompt pulse method unchanged. The next live pulse should wait until the
rate-limit / prepaid-credit boundary is healthy.

## Clean-Baseline Reset Question

The first correction is not a clean comparable baseline yet.

Earlier hard-coded prompt scaffolds were removed from the runtime surface, but
the current Beta `6.0` evidence may still be shaped by that prior config
history. Treat the failed pulse and the grammar-led correction as diagnostic
surfaces. Do not fold the next run into the same line until the baseline
question is settled.

Next research slice:

- define a clean baseline candidate from the proper cleaned config
- keep the fixed-prompt pulse method unchanged
- compare the Beta `6.0` snapshot against the clean baseline line
- use a diagram to make the comparison boundary explicit
- only then decide whether the clean baseline is a new beta boundary or a
reset inside Beta `6.0`

## Relationship To Beta 5.1

`Research Beta 5.1` remains the closed row-level baseline:
Expand Down
15 changes: 12 additions & 3 deletions docs/research/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ Current research stage:

- `Research Beta 6.0`
- `fail-pressure pulse`
- snapshot for clean-baseline planning

Most recently closed beta:

Expand Down Expand Up @@ -60,6 +61,8 @@ Current finding:
- `0` excluded
- live reruns are paused until rate limits and prepaid credits are healthy
again
- the current Beta `6.0` line should be treated as diagnostic rather than clean
comparison evidence until a proper-config baseline is defined

Current active method:

Expand All @@ -68,7 +71,8 @@ Current active method:
- one fixed-prompt pulse is judged at a time
- label rows as `anchor`, `counted_seam`, or `excluded_noise`
- first pulse verdict: `FAIL`
- next work is planning from the failed pulse, not another live run yet
- next work is defining a clean baseline candidate from the proper config, not
another live run yet
- keep row-level `5.1` as the comparison surface, not the active method

## Beta Map
Expand All @@ -80,7 +84,7 @@ Current active method:
| `Research Beta 3.0` | Is a coherent line in-lane? | Prompt relevance separated lane control from sentence quality. |
| `Research Beta 4.1` | Can coherent drift still be valuable? | Coherent absurdity became a small selective class. |
| `Research Beta 5.1` | When does a fail family stay active evidence versus earn eviction? | `retain / evict` stays active, with the instruction surface tightened to preserve shape-first lane control. |
| `Research Beta 6.0` | Can Probaboracle hold shape across a bounded fixed-prompt pulse? | The fixed-prompt pulse becomes the binary unit. |
| `Research Beta 6.0` | Can Probaboracle hold shape across a bounded fixed-prompt pulse? | The fixed-prompt pulse becomes the binary unit, but the first line is held as a snapshot until a clean proper-config baseline is defined. |

Active pulse method:

Expand Down Expand Up @@ -139,8 +143,13 @@ flowchart LR

Plans are useful, but they are not evidence. They do not become active method until the repo earns them.

Parked lanes:
Planning lanes:

- clean baseline reset:
- compare the Beta `6.0` snapshot against a proper-config
baseline
- keep the comparison diagram explicit before deciding whether this becomes a
new beta boundary or a reset inside Beta `6.0`
- provider portability:
- keep OpenAI-native behaviour stable if the runtime surface later widens
- leave room for an Azure-compatible path if it becomes necessary
Expand Down