Deterministic operational replay validation for long-horizon AI agents.
Comptextv7 is a deterministic operational replay-validation and state-survivability prototype: it tests whether compact, replay-safe operational state preserves fixture-defined evidence, constraints, blockers, dependencies, recovery paths, and tool-order signals across compression, reconstruction, iterative replay degradation, and CI-audited summaries — without LLM judges, embeddings, vector databases, graph stores, or external APIs.
See docs/research_positioning.md for conservative project positioning and scope boundaries.
External Monaco showcase repo · Benchmark explanation · Iterative replay degradation · Replay report
Comptextv7 measures whether compressed agent/workflow state can still be replayed as usable operational state — and shows exactly what breaks when compression becomes too aggressive. It extracts fixture-defined evidence, constraints, blockers, dependencies, recovery paths, and tool-order signals; compacts them into replay-safe state; reconstructs them; and validates deterministic survival with committed artifacts rather than LLM judges, embeddings, vector databases, graph stores, or external APIs.
- Extracts operational state from checked-in paper and agent/workflow fixtures.
- Compacts that state into deterministic replay payloads.
- Replays the compacted state into reconstructed operational state.
- Validates deterministic survival of required evidence, constraints, blockers, dependencies, recovery paths, and tool-order signals.
- Labels replay failure modes when operational state is lost or detached.
Summaries can sound fluent while losing blockers, constraints, evidence, dependencies, or recovery paths. Comptextv7 treats that as a measurable replay problem: if compressed state cannot be replayed into usable operational state, the validator records what failed instead of relying on subjective prose quality.
| Signal | Current fixture-bound result |
|---|---|
| Agent trace replay consistency | 1.000000 |
| Paper replay consistency | 0.791667 |
CONSERVATIVE replay consistency |
0.895833 |
BALANCED replay consistency |
0.250000 |
AGGRESSIVE replay consistency |
0.125000 |
BALANCED currently emits these replay failure labels in the comparative degradation fixture summary: EVIDENCE_LOSS and CONSTRAINT_DRIFT.
Interpretation: the profile comparison shows monotonic degradation under increasing compression pressure. That is useful because the benchmark responds to operational-state loss; these results are internal, fixture-bound observations, not external benchmark, production-readiness, or solved-memory claims.
Comptextv7 is a deterministic operational replay-validation and state-survivability prototype. It is complementary to learned context-compression research, RAG evaluation, vector-memory systems, serving-layer cache optimization, and durable workflow infrastructure, but it is not a workflow orchestrator, learned compressor, vector memory system, RAG replacement, KV-cache compressor, autonomous agent framework, production telemetry system, clinical-grade system, or universal AI-memory solution.
For the concise research positioning brief, scope boundaries, and benchmark interpretation, see Research Positioning, Iterative Replay Degradation, the Benchmark Explanation, the committed iterative replay degradation summary, and scripts/validate_replay_artifact_drift.py. The visual Monaco walkthrough lives in the external ProfRandom92/comptext-v7-monaco-showcase repository.
| Evidence | Current result |
|---|---|
| Paper replay fixtures | 3 dense technical papers |
| Agent trace fixtures | 3 multi-step workflows |
| Paper avg compression | 1.347063 |
| Agent avg compression | 1.773954 |
| Paper replay consistency | 0.791667 |
| Agent replay consistency | 1.000000 |
| Agent operational drift | 0.000000 |
| Evaluation mode | deterministic, no LLM judging |
| Artifact format | committed JSON + CI upload |
Sources: artifacts/paper_replay_results.json and artifacts/agent_trace_replay_results.json.
- Paper replay is lossy under dense technical prose. The current paper fixtures include entities, limitations, sections, and metrics that are harder to preserve after compaction.
- Agent trace replay is currently near-lossless because traces are structured. The checked-in traces expose explicit tasks, blockers, dependencies, tool order, and recovery actions.
1.000000replay consistency does not mean solved memory. It means exact preservation under the current structured trace fixtures and current deterministic validator.- Operational drift is field loss, not subjective quality. A non-zero drift rate would mean replay lost required operational fields.
- Iterative replay degradation is a bounded prototype. Repeated compact/replay cycles emit deterministic JSON and Markdown artifacts for reviewing drift curves, collapse points, and failure labels. A small fixture-bound comparison mode contrasts
CONSERVATIVE,BALANCED, andAGGRESSIVEcompression profiles with deterministic per-profile aggregates, and an additive sensitivity-analysis surface varies bounded replay/compression parameters without external services.
- Not chat-history storage.
- Not vector memory.
- Not model-judged summarization.
- Not autonomous agent orchestration.
- Deterministic operational-state replay validation.
flowchart LR
A[Raw Context / Agent Trace]
--> B[Operational State Extraction]
B --> C[Compact Replay State]
C --> D[Replay Reconstruction]
D --> E[Deterministic Validation]
E --> F[CI Artifact]
Comptextv7 turns noisy context into compact operational state, then validates whether replay reconstructs the fields needed to continue work.
- Validates: whether dense technical paper summaries preserve entities, metrics, limitations, and section structure after deterministic replay compression.
- Artifact:
artifacts/paper_replay_results.json. - Method:
docs/benchmarks/paper_replay.md. - Current avg compression:
1.347063. - Current replay consistency:
0.791667.
- Validates: whether multi-step agent workflows preserve active tasks, constraints, dependencies, tool sequences, unresolved blockers, deployment requirements, and recovery actions.
- Artifact:
artifacts/agent_trace_replay_results.json. - Method:
docs/benchmarks/agent_trace_replay.md. - Current avg compression:
1.773954. - Current replay consistency:
1.000000. - Operational drift:
0.000000. - Interpretation: current setup is near-lossless because the fixtures are structured; this is a useful baseline, not a universal memory claim.
- Validates: Deterministic multi-family operational admissibility benchmark with manifest-driven fixture selection, exact scoring, reproducible JSON artifacts, and progression-regression checks.
- Method:
docs/benchmarks/multi_family_admissibility_benchmark.md.
- Validates: how checked-in paper and agent-trace fixtures degrade across bounded repeated compact/replay cycles.
- Method:
docs/iterative_replay_degradation.md. - Profile comparison: additive prototype mode compares
CONSERVATIVE,BALANCED, andAGGRESSIVEcompression profiles using fixture-bound aggregates only: collapse rate, replay consistency, operational drift, evidence survival, and deterministic failure labels. - Sensitivity analysis: additive JSON/Markdown surface varies bounded
max_context_units,max_families,max_bursts,replay_window_seconds,replay_cycles, andcompression_budget_scalevalues for fixture-bound replay degradation review. - Current internal baseline: see the fixture-bound comparative replay degradation results.
- Interpretation: profile comparison rows are deterministic replay-validation observations for the current fixtures, not general memory, production, or clinical-grade claims.
This suite is a separate long-horizon stress surface under reports/replay_continuity/.
It remains useful context, but the focused README narrative is the deterministic operational replay benchmark family above.
| System | Iteration 25 | Iteration 50 | Iteration 100 | Iteration 250 |
|---|---|---|---|---|
| Naive | 0.039 | 0.039 | 0.043 | 0.039 |
| Baseline | 0.294 | 0.294 | 0.294 | 0.294 |
| Adaptive | 0.679 | 0.476 | 0.302 | 0.302 |
| Comptextv7 | 1.000 | 0.995 | 0.824 | 0.572 |
The committed 250-iteration report records Comptextv7 mean final continuity at 0.571783, rounded to 0.572 here.
Detail fidelity still degrades: hidden truth survival is 0.570173, and evaluator agreement divergence is 0.421743.
| System | Approx collapse point |
|---|---|
| Naive | ~1 iteration |
| Baseline | ~10 iterations |
| Adaptive | ~45 iterations |
| Comptextv7 | censored at ~250 iterations in this suite |
replay_degradation_curves.svgcontinuity_half_life_chart.svgsemantic_drift_graph.svgreplay_collapse_curves.svgevaluator_agreement_divergence.svghidden_constraint_survival_curves.svg
- no LLM judging;
- no embeddings;
- no vector DBs;
- no external APIs;
- artifact-backed JSON + CI checks;
- deterministic hashing foundation (
docs/deterministic_hashing.md); - audit-friendly and CI reproducible.
The system relies on the following deterministic foundations:
ReferenceIndexandEventLogArtifactAdapter: track context references and deterministically fingerprint event payloads (docs/reference_index_event_fingerprints.md).ReplayArtifactWriter v1-alpha.1: generates deterministic, standalone JSON artifacts for verifiable snapshots (docs/replay_artifact_writer.md).
- Metrics mentioned in benchmarks are fixture-bound baselines and do not reflect real-world universal correctness.
- Fixtures are curated and checked in.
- Structured agent traces currently replay near-losslessly.
- This is not solved AI memory.
- This is not production telemetry.
- This is not an autonomous agent framework.
- Evaluator divergence remains material in the long-horizon stress suite.
- Iterative degradation remains a bounded fixture prototype; its artifact and summary are review aids, not universal memory claims.
Next: continue tightening deterministic replay review surfaces. Keep repeated compact/replay artifacts cheap, deterministic, additive-compatible, and easy to inspect in CI and pull requests.
Use this short flow when reviewing replay-system changes:
- Regenerate or inspect deterministic replay artifacts only from checked-in fixtures.
- Compare stable metric fields (
replay_consistency, evidence survival rates,operational_drift_rate) and taxonomy fields (failure_labels,failure_mode_counts) rather than prose interpretations. - For iterative degradation and sensitivity review, run
python scripts/generate_iterative_replay_degradation_artifacts.pyand inspect both the JSON artifact and Markdown summary. - Treat additive artifact fields as forward-compatible when existing deterministic fields remain stable.
- Keep claims fixture-bound: no LLM judging, embeddings, external APIs, production-readiness claims, or solved-memory claims.
The main Comptextv7 repository is the source of truth for deterministic replay-validation evidence: artifacts, benchmarks, failure labels, degradation summaries, and conservative research positioning. The visual Monaco walkthrough now lives separately in the external showcase repository.
| Surface | Link |
|---|---|
| CI Artifact Narrative | docs/ci_artifact_narrative.md |
| Benchmark explanation | docs/BENCHMARK_EXPLANATION.md |
| Replay failure taxonomy | docs/operational_replay_failure_taxonomy.md |
| Iterative replay degradation artifact and CI summary | docs/iterative_replay_degradation.md |
| Comparative replay degradation artifact and CI summary | docs/iterative_replay_degradation.md#comparative-replay-degradation-results |
| Replay sensitivity-analysis artifact and CI summary | docs/iterative_replay_degradation.md#replay-sensitivity-analysis-surface |
| Replay report | reports/replay_continuity/validation_report.md |
| API surface | docs/API_SURFACE.md |
| Surface | Link |
|---|---|
| Monaco showcase repository | ProfRandom92/comptext-v7-monaco-showcase |
| Legacy demo walkthrough note | docs/DEMO_WALKTHROUGH.md |
| Legacy showcase readiness note | docs/SHOWCASE_READINESS.md |
Comptextv7/
├── artifacts/ # committed deterministic replay benchmark JSON
├── benchmarks/ # deterministic compression, replay, and audit runners
├── contracts/ # machine-readable validation and handoff contracts
├── dashboard/ # backend plus React operations console
├── docs/ # benchmark, artifact, research, and legacy showcase notes
├── reports/replay_continuity/ # adversarial continuity metrics and SVG charts
├── scripts/ # validation, reporting, and artifact tooling
├── showcase/app/ # legacy in-repo Vite app; Monaco UI lives in external repo
├── src/ # KVTC engine, audit, and semantic validation modules
├── tests/ # Python regression and replay validation tests
└── README.md
Do not commit:
- proprietary customer data;
- secrets, API keys, tokens, cookies, or credentials;
- raw production logs;
- unsanitized replay fixtures;
- private deployment credentials or environment dumps.
Comptextv7 is a deterministic, synthetic-only research prototype for operational replay persistence and reviewable diagnostic infrastructure.
Comptextv7 is biased toward artifact-backed review rather than local machine trust.
| Workflow | Role |
|---|---|
ci.yml |
Runs deterministic replay, tests, telemetry, and validation gates. |
agent-checks.yml |
Runs repository/report/contract checks plus dashboard validation. |
validation_runner.yml |
Publishes compact cloud validation result artifacts. |
Install the Python test dependency set:
python -m pip install -e '.[test]'Regenerate deterministic replay artifacts:
python tests/utils/paper_replay_runner.py
python tests/utils/agent_trace_replay_runner.py
python benchmarks/run_replay_continuity.py --iterations 250 --output-dir reports/replay_continuity
python scripts/generate_iterative_replay_degradation_artifacts.pyUse the validation commands in docs/validation.md. The root package.json is a wrapper for reviewer convenience. App dependencies remain in dashboard/app and the legacy in-repo showcase/app; the current Monaco showcase UI is maintained in ProfRandom92/comptext-v7-monaco-showcase.
Root wrapper checks:
npm run layout
npm run typecheck
npm run validate
npm run build
npm test
npm run checkDashboard app checks:
cd dashboard/app
npm run typecheck
npm run buildShowcase app checks:
cd showcase/app
npm run typecheck
npm run validate
npm run buildPython checks from the repository root:
pytest -q
pytest tests/test_core_foundation_ts.py -q
pytest tests/test_paper_replay_bench.py tests/test_agent_trace_replay.py tests/test_replay_continuity.py -qAdditional repository validation helpers remain available when their surfaces are touched:
python scripts/validate.py replay
python scripts/validate.py token
python scripts/validate.py forensic
python scripts/validate_contracts.py
python scripts/validate_api_exports.py