Skip to content

v0.37.0.0 feat: agent-voice (Mars + Venus) + copy-into-host-repo skillpack paradigm#1128

Open
garrytan wants to merge 13 commits into
masterfrom
garrytan/caracas-v3
Open

v0.37.0.0 feat: agent-voice (Mars + Venus) + copy-into-host-repo skillpack paradigm#1128
garrytan wants to merge 13 commits into
masterfrom
garrytan/caracas-v3

Conversation

@garrytan
Copy link
Copy Markdown
Owner

@garrytan garrytan commented May 17, 2026

Summary

v0.36.0.0 — Voice agent reference (Mars + Venus) lands as the first copy-into-host-repo skillpack, plus all six originally-deferred items now ship in the same PR.

Wave 1 (initial):

  • Voice agent reference bundle (Mars + Venus personas, WebRTC-first browser client with ?test=1-gated WebAudio-tee capture, read-only tool router with D14-A allow-list, persona-aware prompt builder, upstream-error classifier).
  • copy-into-host-repo install paradigm: gbrain integrations install agent-voice --target <repo> copies the bundle into the operator's host agent repo with per-file SHA-256 hashes recorded in .gbrain-source.json.
  • Privacy infrastructure: scripts/check-no-pii-in-agent-voice.sh wired into bun run verify, deterministic scripts/import-from-upstream.sh + substitution table for refresh-from-upstream, single-source private-name-blocklist.json env-driven regex.
  • New install_kind: 'local-managed' | 'copy-into-host-repo' frontmatter discriminator; new src/commands/integrations.ts install subcommand with path-traversal hardening + refusal-case suite.
  • 14 captured decisions during /plan-eng-review with codex outside voice (D1-D14).

Wave 2 (closes original deferred list):

  • E2E test suite: tests/e2e/voice-roundtrip.test.mjs (~$0.10/run; env-gated) + tests/e2e/voice-full-flow.test.mjs (openclaw-driven install + roundtrip; ~$1-2/run; friction-discovery, NOT a ship gate). Shared puppeteer + fake-audio harness + Whisper-judge helper. Three-tier assertion (CONNECTION hard, NON-SILENT hard, SEMANTIC soft).
  • LLM-judge persona evals: gateway-routed 3-model harness (Claude + GPT + Gemini) with 4-strategy JSON repair + 2/3-quorum aggregation. Five fixture sets (mars-solo, mars-demo, venus, persona-routing, mars-multilingual). Synthetic canonical baselines committed; live receipts gitignored.
  • DIY pipeline (Option B): code/pipeline.mjs — streaming Deepgram STT + Claude SSE with sentence-boundary TTS dispatch + Cartesia/OpenAI TTS, 20-turn history cap, exponential reconnect, keepalives, VAD presets, barge-in.
  • --refresh mode: classifies each file (unchanged-identical / unchanged-stale / locally-modified / source-deleted / host-deleted / new-in-manifest); default preserves operator edits; --auto take-theirs overwrites; --dry-run previews. Transaction journal at .gbrain-source.refresh.log.
  • Mars multilingual: persona restores cross-lingual rule (Mandarin/Spanish/French/Japanese/Korean default to English but follow the speaker). Gated by mars-multilingual.jsonl eval fixtures.
  • Twilio recipe deprecation: banner on recipes/twilio-voice-brain.md pointing at agent-voice.md. Removal in v0.37.
  • Claw-test scenario: test/fixtures/claw-test-scenarios/voice-agent-install/ with BRIEF.md + scenario.json (labeled BENCHMARK_FRICTION, blocks_ship: false) + expected.json.

Test Coverage

Suite Result Notes
bun run verify exit 0 All 15 gate checks including the new check:no-pii-agent-voice
Full unit suite (gbrain-side) 6,736+ pass / 0 fail After merge with master
test/integrations-install.test.ts 18 pass / 0 fail Install + refresh + classification (incl. 7 new refresh cases)
Host-side recipes/agent-voice/tests/unit/ 96 pass / 2 skip / 0 fail 5 suites: personas + prompt-shape (Mars + Venus) + tools-allowlist + upstream-classifier. Skipped tests are env-gated.
Live E2E (voice-roundtrip) env-gated Run with AGENT_VOICE_E2E=1 OPENAI_API_KEY=... (~$0.10/run)
Live E2E (voice-full-flow) env-gated, NOT a ship gate Run with AGENT_VOICE_FULL_E2E=1 OPENAI_API_KEY=... ANTHROPIC_API_KEY=... OPENCLAW_BIN=... (~$1-2/run)
LLM-judge persona evals env-gated, manual Run with bun run gen:baselines from recipes/agent-voice/ (~$0.60 for all four suites)

Pre-Landing Review

Covered by /plan-eng-review with codex outside-voice during planning. 14 decisions captured (D1-D14): monolithic ship, PII guard in verify, SHA-256 manifest, error-classified soft-fail, ?test=1-gated instrumentation, real install subcommand, deterministic import script, close all unit gaps, ran codex, monolithic-in-gbrain (vs separate template), zero "Wintermute" in shipped surface, WebAudio-tee capture, agent-authored canonical baselines, read-only tools allow-list.

Codex outside voice surfaced 30+ findings; 8 substantive new ones resolved during the review (D10-D14). Remainder folded into plan-rewrite mechanics.

Privacy

bun run verify includes scripts/check-no-pii-in-agent-voice.sh which scans every file under recipes/agent-voice/** + recipes/agent-voice.md + the import scripts for shape regex (phone/email/SSN/JWT/bearer/credit-card) + path patterns + $AGENT_VOICE_PII_BLOCKLIST env-driven name list. Zero literal private names land in shipped files. The env-driven blocklist contract lives in private-name-blocklist.json; CI passes the secret via env at run time.

Files (this PR)

Path Δ
recipes/agent-voice.md + recipes/agent-voice/ +29 files (~5,500 LOC including pipeline.mjs)
scripts/check-no-pii-in-agent-voice.sh + scripts/import-from-upstream.sh + scripts/upstream-scrub-table.txt +3 files
src/commands/integrations.ts +~500 LOC (install subcommand + refresh mode)
test/integrations-install.test.ts +236 LOC, 18 cases
test/fixtures/claw-test-scenarios/voice-agent-install/ +3 files
recipes/twilio-voice-brain.md deprecation banner
VERSION + package.json + CHANGELOG.md + bun.lock + llms*.txt release bookkeeping

Test plan

  • bun run verify passes (privacy + 14 gate checks + typecheck)
  • Full unit suite passes (6,736+ pass, 0 fail)
  • Install + refresh subcommand: 18/18 tests pass
  • Host-side persona + tool + classifier tests: 96 pass, 2 skip (env-gated)
  • Install smoke: real install into a scratch tmpdir host repo works, manifest written with per-file SHA-256, resolver rows appended to AGENTS.md
  • Refresh smoke: identical-state → 0 writes; locally-modified → keep-mine default preserves edit, --auto take-theirs overwrites; transaction journal written
  • Live voice-roundtrip E2E: run manually with AGENT_VOICE_E2E=1 OPENAI_API_KEY=... against a fresh server install
  • Live voice-full-flow E2E: run manually with AGENT_VOICE_FULL_E2E=1 + all three keys + OPENCLAW_BIN
  • LLM-judge eval pass: run cd recipes/agent-voice && bun run gen:baselines and verify pass on all 4 suites

🤖 Generated with Claude Code

garrytan and others added 7 commits May 17, 2026 14:43
…-repo install paradigm

Ships a new skillpack paradigm: gbrain holds the REFERENCE content;
`gbrain integrations install agent-voice --target <repo>` COPIES it into
the operator's host agent repo where it becomes user-owned and mutable.
Future refresh is diff-and-propose against per-file SHA-256 hashes from
.gbrain-source.json, not blind overwrite.

What ships:
- recipes/agent-voice.md entrypoint + recipes/agent-voice/ bundle
- Two voice personas (Mars dual-mode SOLO/DEMO, Venus executive assistant)
  with PII / private-agent-name / hardcoded-path scrubbed out
- WebRTC-first browser client (call.html) with ?test=1 gated instrumentation
  and Web Audio API tee -> MediaRecorder capture for E2E roundtrip testing
- Read-only tool router (D14-A allow-list: search, query, get_page,
  list_pages, find_experts, get_recent_salience, get_recent_transcripts,
  read_article). Write ops permanently denylisted; opt-in via local override
- Persona-aware prompt builder with identity-first composition + Unicode
  sanitization for OpenAI Realtime API safety
- Upstream-error classifier (HTTP 429/500/503 -> soft-fail, plumbing -> hard)
- Three SKILL.md skills (voice-persona-mars, voice-persona-venus,
  voice-post-call) with routing-eval.jsonl fixtures
- 99 host-side tests (vitest-compatible, runs in bun) covering registry,
  prompt-shape privacy guards, tool allow-list, upstream classifier
- install/manifest.json + refresh-algorithm.md + post-install-hint.md

Privacy infrastructure:
- scripts/check-no-pii-in-agent-voice.sh wired into bun run verify
  Shape regex (phone/email/SSN/JWT/bearer/credit-card) + path patterns +
  $AGENT_VOICE_PII_BLOCKLIST env-driven name blocklist
- scripts/import-from-upstream.sh + scripts/upstream-scrub-table.txt
  Deterministic refresh from upstream voice-agent source. Placeholder-
  driven (envsubst-expanded at run time) so no private names land in
  checked-in files
- recipes/agent-voice/code/lib/personas/private-name-blocklist.json
  Single source of truth for the regex contract (shape categories +
  path patterns + env-var contract for operator-specific names)

src/ surface:
- src/commands/integrations.ts gains `install <recipe-id>` subcommand
  with install_kind: 'local-managed' | 'copy-into-host-repo' discriminator.
  Path-traversal hardening (rejects '..', absolute paths, symlink escapes).
  Refuses target == gbrain itself, missing .git, existing files (without
  --overwrite). Writes .gbrain-source.json with per-file SHA-256. Appends
  resolver rows to host repo's RESOLVER.md or AGENTS.md.
- test/integrations-install.test.ts: 11 cases (happy path, manifest shape,
  no upstream_repo field per D11-A, resolver appending, file modes,
  refusal cases, dry-run)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
…ts + guard script

The prompt-shape tests carried regex patterns naming the literal banned terms
(Garry/Steph/Garrison/Solomon/Herbert/Wintermute) inline. CLAUDE.md's
"never use Wintermute in any public artifact" applies to test source files
too. Master's check-privacy.sh correctly caught this.

Replaced with env-driven check that reads AGENT_VOICE_PII_BLOCKLIST (the
single source of truth from private-name-blocklist.json). Same enforcement
guarantee via the env var, zero literal names in shipped source.

Also scrubbed the literal /data/.openclaw/ from the guard script's comment
and the literal 'tell_wintermute' from the venus write-tools test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eline + refresh + multilingual + twilio deprecation)

Closes the "deferred to follow-up" section of the v0.36.0.0 CHANGELOG.

E2E tests + harness (env-gated):
- tests/e2e/voice-roundtrip.test.mjs — spawns server, drives puppeteer + fake-audio, three-tier assertions (CONNECTION hard, NON-SILENT hard, SEMANTIC soft via Whisper + LLM judge). Upstream errors (429/500/503, WS 1011/1013) soft-fail via lib/upstream-classifier.mjs.
- tests/e2e/voice-full-flow.test.mjs — wraps openclaw doing the install, then runs the roundtrip. Friction-discovery flavor, NOT a ship gate.
- tests/e2e/lib/browser-audio.mjs — puppeteer + fake-audio harness; reads window._gbrainTest namespace; PCM RMS-variance helper.
- tests/e2e/lib/whisper-judge.mjs — Whisper transcription + LLM-judge for SEMANTIC tier.
- tests/e2e/audio-fixtures/utterance-{add,joke,brain-query}.wav — 16kHz mono WAV via `say` + ffmpeg, committed for reproducibility.
- test/fixtures/claw-test-scenarios/voice-agent-install/{BRIEF.md, scenario.json, expected.json} — labeled BENCHMARK_FRICTION, blocks_ship=false.

LLM-judge persona evals + synthetic canonical baselines:
- tests/evals/judge.mjs — gateway-routed 3-model (Claude + GPT + Gemini) harness with 4-strategy JSON repair + 2/3-quorum aggregation (per the v0.27.x cross-modal pattern). Pass criterion: every axis mean ≥7 AND no model <5.
- tests/evals/fixtures/{mars-solo,mars-demo,venus,persona-routing,mars-multilingual}.jsonl — 5 fixture sets covering all axes.
- tests/evals/{mars-eval,venus-eval,mars-multilingual-eval,persona-routing-eval}.mjs — per-axis drivers.
- tests/evals/baseline-runs/canonical/*.json — agent-authored synthetic exemplars (PII-impossible by construction; demonstrate expected pass shape; never overwrite with live model output).
- tests/evals/baseline-runs/.gitignore — live receipts excluded.

DIY pipeline (Option B):
- code/pipeline.mjs — streaming STT (Deepgram nova-2) + LLM (Claude Sonnet 4.6 streaming SSE with sentence-boundary TTS dispatch) + TTS (Cartesia primary, OpenAI TTS fallback). 20-turn history cap, exponential-backoff reconnects, 25s keepalives, VAD presets (quiet/normal/noisy/very_noisy), barge-in via STT speechStart → LLM interrupt. Modular adapters for swapping providers.

--refresh mode (D3-A diff-and-propose):
- src/commands/integrations.ts: refreshRecipeIntoHostRepo() + classifyForRefresh() implementing the five states from refresh-algorithm.md (unchanged-identical, unchanged-stale, locally-modified, source-deleted, host-deleted, new-in-manifest). Transaction journal at .gbrain-source.refresh.log. Default policy: preserve operator's local edits (keep-mine); --auto take-theirs to overwrite; --dry-run for preview.
- test/integrations-install.test.ts: 7 new test cases pinning each classification state + default-preserve behavior + take-theirs overwrite + transaction journal + refusal on uninstalled target.

Mars multilingual restore:
- code/lib/personas/mars.mjs: explicit cross-lingual rule (Mandarin, Spanish, French, Japanese, Korean default to English but follow the speaker). Voice (Orus) supports the languages natively.
- tests/unit/mars-prompt-shape.test.mjs: assertion flipped from "MUST NOT claim multilingual" to "declares cross-lingual capability with English bias."
- tests/evals/fixtures/mars-multilingual.jsonl: 5 fixtures across Mandarin/Spanish/Japanese/French + explicit switch-back, pinned by mars-multilingual-eval.mjs.

Twilio recipe deprecation:
- recipes/twilio-voice-brain.md: deprecation banner pointing at agent-voice.md. Frontmatter version bumped to 0.8.2. Will be removed in v0.37.

Verify: bun run verify clean, 6736+ unit tests pass, 18/18 install+refresh tests pass, 96/98 host-side persona/tool/classifier tests pass (2 skipped env-gated).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…this PR

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the wave-1 + wave-2 scope at the v0.37 slot. The bump reflects
the size of what this PR ships: copy-into-host-repo install paradigm
(new install_kind discriminator + new install/refresh subcommand) +
Mars/Venus voice agent reference + 5,500+ LOC of vendored scrubbed
code + 4 LLM-judge eval suites + 2 env-gated E2E test suites + DIY
Option B pipeline + 18-case install subcommand test coverage. A minor
bump felt too small.

Side fix: privacy guard caught two stale literal "wintermute" and
"/data/.openclaw/" references in wave-2 files
(voice-full-flow.test.mjs comment, expected.json blocklist payload).
Both replaced with env-driven references to $AGENT_VOICE_PII_BLOCKLIST
matching the D15-A pattern from the original review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan changed the title v0.36.0.0 feat: agent-voice (Mars + Venus) + copy-into-host-repo skillpack paradigm v0.37.0.0 feat: agent-voice (Mars + Venus) + copy-into-host-repo skillpack paradigm May 17, 2026
garrytan added 6 commits May 17, 2026 15:48
# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant