Comptextv7

Deterministic operational replay validation for long-horizon AI agents.

Comptextv7 is a deterministic operational replay-validation and state-survivability prototype: it tests whether compact, replay-safe operational state preserves fixture-defined evidence, constraints, blockers, dependencies, recovery paths, and tool-order signals across compression, reconstruction, iterative replay degradation, and CI-audited summaries — without LLM judges, embeddings, vector databases, graph stores, or external APIs.

See docs/research_positioning.md for conservative project positioning and scope boundaries.

External Monaco showcase repo · Benchmark explanation · Iterative replay degradation · Replay report

Thesis

Comptextv7 measures whether compressed agent/workflow state can still be replayed as usable operational state — and shows exactly what breaks when compression becomes too aggressive. It extracts fixture-defined evidence, constraints, blockers, dependencies, recovery paths, and tool-order signals; compacts them into replay-safe state; reconstructs them; and validates deterministic survival with committed artifacts rather than LLM judges, embeddings, vector databases, graph stores, or external APIs.

What Comptextv7 does

Extracts operational state from checked-in paper and agent/workflow fixtures.
Compacts that state into deterministic replay payloads.
Replays the compacted state into reconstructed operational state.
Validates deterministic survival of required evidence, constraints, blockers, dependencies, recovery paths, and tool-order signals.
Labels replay failure modes when operational state is lost or detached.

Why this matters

Summaries can sound fluent while losing blockers, constraints, evidence, dependencies, or recovery paths. Comptextv7 treats that as a measurable replay problem: if compressed state cannot be replayed into usable operational state, the validator records what failed instead of relying on subjective prose quality.

Current signal

Signal	Current fixture-bound result
Agent trace replay consistency	`1.000000`
Paper replay consistency	`0.791667`
`CONSERVATIVE` replay consistency	`0.895833`
`BALANCED` replay consistency	`0.250000`
`AGGRESSIVE` replay consistency	`0.125000`

BALANCED currently emits these replay failure labels in the comparative degradation fixture summary: EVIDENCE_LOSS and CONSTRAINT_DRIFT.

Interpretation: the profile comparison shows monotonic degradation under increasing compression pressure. That is useful because the benchmark responds to operational-state loss; these results are internal, fixture-bound observations, not external benchmark, production-readiness, or solved-memory claims.

Positioning boundaries

Comptextv7 is a deterministic operational replay-validation and state-survivability prototype. It is complementary to learned context-compression research, RAG evaluation, vector-memory systems, serving-layer cache optimization, and durable workflow infrastructure, but it is not a workflow orchestrator, learned compressor, vector memory system, RAG replacement, KV-cache compressor, autonomous agent framework, production telemetry system, clinical-grade system, or universal AI-memory solution.

For the concise research positioning brief, scope boundaries, and benchmark interpretation, see Research Positioning, Iterative Replay Degradation, the Benchmark Explanation, the committed iterative replay degradation summary, and scripts/validate_replay_artifact_drift.py. The visual Monaco walkthrough lives in the external ProfRandom92/comptext-v7-monaco-showcase repository.

Proof at a glance

Evidence	Current result
Paper replay fixtures	3 dense technical papers
Agent trace fixtures	3 multi-step workflows
Paper avg compression	`1.347063`
Agent avg compression	`1.773954`
Paper replay consistency	`0.791667`
Agent replay consistency	`1.000000`
Agent operational drift	`0.000000`
Evaluation mode	deterministic, no LLM judging
Artifact format	committed JSON + CI upload

Sources: artifacts/paper_replay_results.json and artifacts/agent_trace_replay_results.json.

How to read these values

Paper replay is lossy under dense technical prose. The current paper fixtures include entities, limitations, sections, and metrics that are harder to preserve after compaction.
Agent trace replay is currently near-lossless because traces are structured. The checked-in traces expose explicit tasks, blockers, dependencies, tool order, and recovery actions.
1.000000 replay consistency does not mean solved memory. It means exact preservation under the current structured trace fixtures and current deterministic validator.
Operational drift is field loss, not subjective quality. A non-zero drift rate would mean replay lost required operational fields.
Iterative replay degradation is a bounded prototype. Repeated compact/replay cycles emit deterministic JSON and Markdown artifacts for reviewing drift curves, collapse points, and failure labels. A small fixture-bound comparison mode contrasts CONSERVATIVE, BALANCED, and AGGRESSIVE compression profiles with deterministic per-profile aggregates, and an additive sensitivity-analysis surface varies bounded replay/compression parameters without external services.

What makes this different

Not chat-history storage.
Not vector memory.
Not model-judged summarization.
Not autonomous agent orchestration.
Deterministic operational-state replay validation.

Architecture

flowchart LR
    A[Raw Context / Agent Trace]
    --> B[Operational State Extraction]
    B --> C[Compact Replay State]
    C --> D[Replay Reconstruction]
    D --> E[Deterministic Validation]
    E --> F[CI Artifact]

Comptextv7 turns noisy context into compact operational state, then validates whether replay reconstructs the fields needed to continue work.

Benchmark family

Paper Replay Benchmark

Validates: whether dense technical paper summaries preserve entities, metrics, limitations, and section structure after deterministic replay compression.
Artifact: artifacts/paper_replay_results.json.
Method: docs/benchmarks/paper_replay.md.
Current avg compression: 1.347063.
Current replay consistency: 0.791667.

Agent Trace Replay Benchmark

Validates: whether multi-step agent workflows preserve active tasks, constraints, dependencies, tool sequences, unresolved blockers, deployment requirements, and recovery actions.
Artifact: artifacts/agent_trace_replay_results.json.
Method: docs/benchmarks/agent_trace_replay.md.
Current avg compression: 1.773954.
Current replay consistency: 1.000000.
Operational drift: 0.000000.
Interpretation: current setup is near-lossless because the fixtures are structured; this is a useful baseline, not a universal memory claim.

Multi-Family Operational Admissibility Benchmark

Validates: Deterministic multi-family operational admissibility benchmark with manifest-driven fixture selection, exact scoring, reproducible JSON artifacts, and progression-regression checks.
Method: docs/benchmarks/multi_family_admissibility_benchmark.md.

Iterative Replay Degradation Prototype

Validates: how checked-in paper and agent-trace fixtures degrade across bounded repeated compact/replay cycles.
Method: docs/iterative_replay_degradation.md.
Profile comparison: additive prototype mode compares CONSERVATIVE, BALANCED, and AGGRESSIVE compression profiles using fixture-bound aggregates only: collapse rate, replay consistency, operational drift, evidence survival, and deterministic failure labels.
Sensitivity analysis: additive JSON/Markdown surface varies bounded max_context_units, max_families, max_bursts, replay_window_seconds, replay_cycles, and compression_budget_scale values for fixture-bound replay degradation review.
Current internal baseline: see the fixture-bound comparative replay degradation results.
Interpretation: profile comparison rows are deterministic replay-validation observations for the current fixtures, not general memory, production, or clinical-grade claims.

Complementary adversarial replay stress suite

This suite is a separate long-horizon stress surface under reports/replay_continuity/. It remains useful context, but the focused README narrative is the deterministic operational replay benchmark family above.

System	Iteration 25	Iteration 50	Iteration 100	Iteration 250
Naive	0.039	0.039	0.043	0.039
Baseline	0.294	0.294	0.294	0.294
Adaptive	0.679	0.476	0.302	0.302
Comptextv7	1.000	0.995	0.824	0.572

The committed 250-iteration report records Comptextv7 mean final continuity at 0.571783, rounded to 0.572 here. Detail fidelity still degrades: hidden truth survival is 0.570173, and evaluator agreement divergence is 0.421743.

System	Approx collapse point
Naive	~1 iteration
Baseline	~10 iterations
Adaptive	~45 iterations
Comptextv7	censored at ~250 iterations in this suite

Visual artifacts

Integrity model

no LLM judging;
no embeddings;
no vector DBs;
no external APIs;
artifact-backed JSON + CI checks;
deterministic hashing foundation (docs/deterministic_hashing.md);
audit-friendly and CI reproducible.

Foundational Components

The system relies on the following deterministic foundations:

ReferenceIndex and EventLogArtifactAdapter: track context references and deterministically fingerprint event payloads (docs/reference_index_event_fingerprints.md).
ReplayArtifactWriter v1-alpha.1: generates deterministic, standalone JSON artifacts for verifiable snapshots (docs/replay_artifact_writer.md).

Limitations

Metrics mentioned in benchmarks are fixture-bound baselines and do not reflect real-world universal correctness.
Fixtures are curated and checked in.
Structured agent traces currently replay near-losslessly.
This is not solved AI memory.
This is not production telemetry.
This is not an autonomous agent framework.
Evaluator divergence remains material in the long-horizon stress suite.
Iterative degradation remains a bounded fixture prototype; its artifact and summary are review aids, not universal memory claims.

Next technical milestone

Next: continue tightening deterministic replay review surfaces. Keep repeated compact/replay artifacts cheap, deterministic, additive-compatible, and easy to inspect in CI and pull requests.

Validated deterministic replay review flow

Use this short flow when reviewing replay-system changes:

Regenerate or inspect deterministic replay artifacts only from checked-in fixtures.
Compare stable metric fields (replay_consistency, evidence survival rates, operational_drift_rate) and taxonomy fields (failure_labels, failure_mode_counts) rather than prose interpretations.
For iterative degradation and sensitivity review, run python scripts/generate_iterative_replay_degradation_artifacts.py and inspect both the JSON artifact and Markdown summary.
Treat additive artifact fields as forward-compatible when existing deterministic fields remain stable.
Keep claims fixture-bound: no LLM judging, embeddings, external APIs, production-readiness claims, or solved-memory claims.

Review surfaces

The main Comptextv7 repository is the source of truth for deterministic replay-validation evidence: artifacts, benchmarks, failure labels, degradation summaries, and conservative research positioning. The visual Monaco walkthrough now lives separately in the external showcase repository.

Main repo technical evidence

Surface	Link
CI Artifact Narrative	`docs/ci_artifact_narrative.md`
Benchmark explanation	`docs/BENCHMARK_EXPLANATION.md`
Replay failure taxonomy	`docs/operational_replay_failure_taxonomy.md`
Iterative replay degradation artifact and CI summary	`docs/iterative_replay_degradation.md`
Comparative replay degradation artifact and CI summary	`docs/iterative_replay_degradation.md#comparative-replay-degradation-results`
Replay sensitivity-analysis artifact and CI summary	`docs/iterative_replay_degradation.md#replay-sensitivity-analysis-surface`
Replay report	`reports/replay_continuity/validation_report.md`
API surface	`docs/API_SURFACE.md`

External Monaco showcase UI

Surface	Link
Monaco showcase repository	`ProfRandom92/comptext-v7-monaco-showcase`
Legacy demo walkthrough note	`docs/DEMO_WALKTHROUGH.md`
Legacy showcase readiness note	`docs/SHOWCASE_READINESS.md`

Repository map

Comptextv7/
├── artifacts/                  # committed deterministic replay benchmark JSON
├── benchmarks/                 # deterministic compression, replay, and audit runners
├── contracts/                  # machine-readable validation and handoff contracts
├── dashboard/                  # backend plus React operations console
├── docs/                       # benchmark, artifact, research, and legacy showcase notes
├── reports/replay_continuity/  # adversarial continuity metrics and SVG charts
├── scripts/                    # validation, reporting, and artifact tooling
├── showcase/app/               # legacy in-repo Vite app; Monaco UI lives in external repo
├── src/                        # KVTC engine, audit, and semantic validation modules
├── tests/                      # Python regression and replay validation tests
└── README.md

Safety boundaries

Do not commit:

proprietary customer data;
secrets, API keys, tokens, cookies, or credentials;
raw production logs;
unsanitized replay fixtures;
private deployment credentials or environment dumps.

Comptextv7 is a deterministic, synthetic-only research prototype for operational replay persistence and reviewable diagnostic infrastructure.

Cloud-first validation

Comptextv7 is biased toward artifact-backed review rather than local machine trust.

Workflow	Role
`ci.yml`	Runs deterministic replay, tests, telemetry, and validation gates.
`agent-checks.yml`	Runs repository/report/contract checks plus dashboard validation.
`validation_runner.yml`	Publishes compact cloud validation result artifacts.

Reproducibility

Install the Python test dependency set:

python -m pip install -e '.[test]'

Regenerate deterministic replay artifacts:

python tests/utils/paper_replay_runner.py
python tests/utils/agent_trace_replay_runner.py
python benchmarks/run_replay_continuity.py --iterations 250 --output-dir reports/replay_continuity
python scripts/generate_iterative_replay_degradation_artifacts.py

Use the validation commands in docs/validation.md. The root package.json is a wrapper for reviewer convenience. App dependencies remain in dashboard/app and the legacy in-repo showcase/app; the current Monaco showcase UI is maintained in ProfRandom92/comptext-v7-monaco-showcase.

Root wrapper checks:

npm run layout
npm run typecheck
npm run validate
npm run build
npm test
npm run check

Dashboard app checks:

cd dashboard/app
npm run typecheck
npm run build

Showcase app checks:

cd showcase/app
npm run typecheck
npm run validate
npm run build

Python checks from the repository root:

pytest -q
pytest tests/test_core_foundation_ts.py -q
pytest tests/test_paper_replay_bench.py tests/test_agent_trace_replay.py tests/test_replay_continuity.py -q

Additional repository validation helpers remain available when their surfaces are touched:

python scripts/validate.py replay
python scripts/validate.py token
python scripts/validate.py forensic
python scripts/validate_contracts.py
python scripts/validate_api_exports.py

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
.github		.github
artifacts		artifacts
benchmarks		benchmarks
config/hash-companion		config/hash-companion
contracts		contracts
dashboard		dashboard
datasets/golden		datasets/golden
docs		docs
fixtures		fixtures
reports		reports
scripts		scripts
showcase		showcase
src		src
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
DETERMINISM_REPORT.md		DETERMINISM_REPORT.md
FORENSIC_AUDIT.md		FORENSIC_AUDIT.md
GEMINI.md		GEMINI.md
GOLDEN_CORPUS.md		GOLDEN_CORPUS.md
INDUSTRIAL_READINESS.md		INDUSTRIAL_READINESS.md
PR_REVIEW_MATRIX.md		PR_REVIEW_MATRIX.md
README.md		README.md
RECONSTRUCTION_DRIFT_REPORT.md		RECONSTRUCTION_DRIFT_REPORT.md
REPRODUCIBILITY.md		REPRODUCIBILITY.md
TOKEN_TELEMETRY_REPORT.md		TOKEN_TELEMETRY_REPORT.md
VALIDATION_REPORT.md		VALIDATION_REPORT.md
package.json		package.json
program.md		program.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Comptextv7

Thesis

What Comptextv7 does

Why this matters

Current signal

Positioning boundaries

Proof at a glance

How to read these values

What makes this different

Architecture

Benchmark family

Paper Replay Benchmark

Agent Trace Replay Benchmark

Multi-Family Operational Admissibility Benchmark

Iterative Replay Degradation Prototype

Complementary adversarial replay stress suite

Visual artifacts

Integrity model

Foundational Components

Limitations

Next technical milestone

Validated deterministic replay review flow

Review surfaces

Main repo technical evidence

External Monaco showcase UI

Repository map

Safety boundaries

Cloud-first validation

Reproducibility

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages