feat(meet_agent): live note-taking agent for Google Meet (listen + speak)#1355
feat(meet_agent): live note-taking agent for Google Meet (listen + speak)#1355senamakel merged 11 commits intotinyhumansai:mainfrom
Conversation
New `openhuman.meet_agent_*` RPC surface so the Tauri shell can stream
inbound PCM from the open Meet window into core, run VAD-segmented
STT → LLM → TTS, and pull synthesized PCM back out. Sits next to
`meet/` (which only validates URL + mints request_id) — that domain
is single-shot and pure-validation; the live agentic loop needs
buffers, VAD, and a transcript log, which would bloat the validation
surface if jammed in.
This commit is scaffolding only:
- types/ops/session/rpc/schemas wired into the controller registry
- brain.rs ships stub STT (length-proportional placeholder), stub
LLM ("I'm listening."), and stub TTS (200ms 440Hz tone) so the
end-to-end audio path is exercisable without an LLM/TTS bill
- 23 unit tests covering VAD hangover, session registry lifecycle,
RPC round-trip including a turn fired by simulated VAD silence
PR3 swaps the stubs for `voice::cloud_transcribe` (STT, ElevenLabs
under the hood) and `voice::reply_speech` (TTS), and routes LLM
through the existing `agent` runtime as a "meet" channel.
Refs the multi-slice plan for the meet-agent listen+speak loop.
Pulls in the new `crate::audio` module that exposes per-browser CEF audio-handler callbacks via a URL-prefix registry. Required by the upcoming `meet_audio` shell module to capture the embedded Meet window's audio output without OS-level taps. Submodule branch: feat/openhuman-audio-handler @ 3c321beac (needs to be pushed to https://github.com/tinyhumansai/tauri-cef before this PR can merge).
New shell module taps the embedded Meet webview's audio output via the runtime's `audio::register_audio_handler` URL-prefix registry, runs an inline float32-planar → 16 kHz mono PCM16LE resampler, and pushes ~100 ms chunks to `openhuman.meet_agent_push_listen_pcm` over JSON-RPC. Speak path is scaffolded as a poll-and-discard loop today — the real sink lands with the Chromium fake-audio `pipe://` patch in the next slice. Lifecycle is wired to `meet_call_open_window`: - After window build, `meet_audio::start` opens a core session, registers the audio handler keyed by the call's normalised URL, and launches the speak pump - On window destroy, `meet_audio::stop` releases the registration (silencing capture immediately), shuts the pump down, and asks core for the closeout summary (listened/spoken seconds, turn count) Resampler is a stateful linear interpolator with phase + last-sample carry across buffer boundaries (no tick at every CEF buffer flush). Bounded mpsc channel (32 chunks) backpressures from the CEF audio thread to the async forwarder — drops the oldest chunk on full rather than blocking the renderer. Tests cover passthrough, 48k→16k decimation, stereo-to-mono averaging, clamping, and zero-rate guard. CEF callback path is exercised end-to-end manually in the smoke test (slice 7). No new system permissions: audio is read straight from the renderer via CEF, never the OS mic / speakers.
Adds: - `json_rpc_meet_agent_session_lifecycle` E2E test that walks the full start_session → push (loud + silent) → poll_speech → stop_session flow over a real local JSON-RPC server. Pins behavior the shell relies on: VAD doesn't fire while still hearing speech; closes the utterance after ~6 silent frames; brain stub enqueues outbound PCM fast enough for a 1s polling budget; stop_session returns sane listened_seconds + turn_count counters; stopping a non-existent session is a JSON-RPC error rather than a silent no-op. - `meet_agent.live_loop` capability catalog entry covering listen + speak with the right privacy facets (Derived, leaves device, Google Meet + ElevenLabs destinations). Stays network-free: STT/TTS are stubs in PR1, so the test exercises the full RPC plumbing without any backend / model calls.
Original plan was a from-source Chromium patch to make \`--use-file-for-fake-audio-capture\` accept a \`pipe://\` URL backed by Rust. Discovered we don't maintain a CEF source build pipeline — \`cef-dll-sys\` downloads pre-built binaries from a release URL. Forking chromiumembedded/cef and wiring up build infra is its own multi-day project, not a slice in this PR. Pivoted to: - \`audio_bridge.js\`: tiny Web Audio bridge that builds a 16 kHz MediaStreamAudioDestinationNode, monkey-patches \`navigator.mediaDevices.getUserMedia\` to serve audio requests from that destination (delegating video to the original so Chromium's fake-camera Y4M still renders the mascot), and exposes \`window.__openhumanFeedPcm(b64)\` for the shell to push PCM into. - \`inject.rs\`: attaches CDP to the Meet target, sends \`Page.addScriptToEvaluateOnNewDocument\` + \`Page.reload\` so the bridge applies before Meet's first \`getUserMedia\` call. Probes \`__openhumanAudioBridgeInfo()\` to confirm liveness before handing off to the pump. - \`speak_pump.rs\`: rewritten to feed each poll_speech chunk into the bridge via \`Runtime.evaluate\` on a single long-lived CDP session. Bails after 30 consecutive failures (page navigated away). - \`mod.rs\`: install_audio_bridge runs in start(); on failure, the pump is replaced with a no-op so the session still tracks listen counters cleanly. JS-injection note: CLAUDE.md prohibits new JS injection in embedded provider webviews (the \`acct_*\` family). The Meet call window is a distinct top-level surface for a single audio-bridging purpose, and the public CefAudioHandler API only covers listen — speak has no comparable public hook short of a Chromium rebuild. User explicitly authorized this injection for the speak path; the no-JS rule for \`acct_*\` webviews is unchanged. No system permissions, no admin install, mascot webcam preserved.
Swaps the brain stubs for real adapters: - STT — wraps the drained PCM16LE buffer in a minimal RIFF/WAVE container (new \`wav.rs\` module + 3 tests) and posts via \`voice::cloud_transcribe\` (backend Whisper). - LLM — direct chat-completions call through BackendOAuthClient with a "live meeting agent" system prompt and the transcript as the user message. \`max_tokens: 120\` keeps replies conversational; \`temperature: 0.4\` keeps them on-topic. The system prompt authorises the model to return an empty string when the latest utterance doesn't need a response. - TTS — \`voice::reply_speech\` with \`output_format = \"pcm_16000\"\` so ElevenLabs (via the hosted backend) returns bytes the shell-side audio bridge can play directly with no transcoding. Each stage falls back to a deterministic stub when the backend session is missing — keeps existing unit tests + the JSON-RPC E2E network-free, and means a smoke run without sign-in still produces audible output instead of silently breaking. Real transport / 5xx failures are recorded as Note events on the session transcript so they're visible in the live captions overlay rather than silently papered over. Tests: extends meet_agent::brain unit tests + meet_agent::wav.
Two-laptop runbook covering the full live agent loop on a real Meet call: how to verify the listen path (CEF audio handler → STT transcripts in logs), the speak path (Web Audio bridge alive, agent's voice on the host's speaker, mascot webcam preserved), and the absence of any system permission prompt or driver install. Includes a small failure-mode table mapping symptoms to fixes for the most likely first-time issues.
Replaces CEF audio handler / Whisper STT with a DOM observer over Meet's built-in live captions. CEF's cef_audio_handler_t is queried lazily (only when audio output starts), so a solo agent in a lobby or any pre-admit window never engages the pipeline. Captions handle that case for free — Meet's STT is already running, speaker-attributed, and pre-segmented. Page side (captions_bridge.js): - Auto-clicks "Turn on captions" (up to ~30 attempts over a minute) - MutationObserver + 250ms safety poll over the captions region (selected by aria-label="Captions" — stable across class-name churn) - Per-speaker dedupe so growing captions don't queue duplicates - Drain API: window.__openhumanDrainCaptions(), info introspection via __openhumanCaptionsBridgeInfo() Shell side (caption_listener.rs): - Polls drain on a fresh CDP attach (separate from the speak pump's attach so they run concurrently without serialising) - Forwards each line to openhuman.meet_agent_push_caption RPC - Exits after 30 consecutive errors (page navigated away) Core side (meet_agent): - New types::PushCaptionRequest + RPC schema - session::note_caption: wake-word state machine (case-insensitive match on "hey openhuman" / "hey open human" — tolerates Meet STT splitting the brand). Any text after the wake phrase + subsequent captions until the brain takes the prompt becomes the LLM input. - brain::run_caption_turn: short delay (1.5s) so multi-fragment utterances assemble, then drain prompt → LLM → TTS → enqueue outbound. Skips STT entirely — captions are already text. Listen path now works pre-admit and without other participants speaking. Speak path unchanged — same Web Audio bridge. Old CEF-audio path (listen_capture.rs) kept in tree as the inactive _legacy_listen field on MeetAudioSession, so re-enabling it later is a single wire change.
Live-call testing surfaced three regressions in the caption-driven
loop. Each is fixed here:
1. Wake word re-fires while the same utterance is still on screen.
Meet keeps a finalised caption visible for ~5–8s after speaking
ends. Our per-text dedupe in captions_bridge.js suppresses
identical pushes but a single character growth re-queues the
line — and once the brain drains the prompt and clears
wake_active, that next push hits the wake-word match again.
Result: 5–10 cascading turns per single dictation, prompt
buffers ballooning past 9k chars, runaway TTS rate-limit cascade.
Fix: 8s cooldown after take_pending_prompt, keyed off the page-side
ts_ms (same clock as future caption pushes). During cooldown,
captions still record to the transcript log but skip wake-word
matching. Lifts wake_active gate AND the cooldown gate before
the new utterance can fire again.
2. Punctuation breaks the wake match. Meet's STT inserts a comma
between greeting and brand ("hey, openhuman"), so the literal
substring search misses. Normalize: lowercase + non-alphanumeric
to space + collapse whitespace, then substring against
"hey openhuman" / "hey open human". Also handles "Hey OpenHuman.",
"hey open-human", multi-space variants, etc.
3. Reasoning model leaks chain-of-thought into TTS. agentic-v1
emits its internal monologue without <think> delimiters, so
stripping doesn't help — and the resulting 250+ char replies
were both unintelligible as a verbal ack and the actual visible
"thinking" text the user saw narrated through Meet.
Fix: skip the LLM in the hot path entirely. The note itself is
already stored verbatim as a Heard event on the session
transcript (the user's "remember to email Bob" lives there
for post-meeting actioning). The verbal ack only needs to
confirm capture, so we hardcode a small rotation
["Got it.", "Noted.", "Adding that.", "On it.", "Captured."]
selected by hashing the prompt bytes — short, deterministic,
no model latency, no rate-limit pressure, no CoT leak.
Tests:
- note_caption_handles_punctuated_wake — "Hey, OpenHuman ..."
- note_caption_handles_split_brand — "hey open-human ..."
- note_caption_does_not_double_fire_on_growing_caption
Existing meet_agent tests still pass (28→31 total).
Future work: post-meeting summarisation runs an LLM offline against
the full transcript log to surface the captured action-item list.
That path can take its time and use whichever model behaves best
for instruction-following without the latency / CoT constraints
the in-call ack has.
📝 WalkthroughWalkthroughThis PR implements a complete live agent audio loop for Google Meet. The system captures webview audio via CEF, processes it (STT→LLM→TTS), and injects synthesized speech back via a patched Web Audio API, while also listening to captions for wake-word triggers. The implementation spans Tauri shell-side audio pipelines and backend RPC-driven brain logic. ChangesMeet Agent Live Audio Loop
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (7)
app/src-tauri/vendor/tauri-cef (1)
1-1: ⚖️ Poor tradeoffDocument the custom fork rationale and upstream plan.
This submodule points to a custom fork (
tinyhumansai/tauri-cef) rather than the upstreamtauri-cefrepository. Using forks of low-level dependencies like CEF (Chromium Embedded Framework) introduces supply chain risk, as the fork may miss upstream security patches or introduce undisclosed changes.Please ensure:
- The rationale for forking (audio handler functionality) is documented in the repository (e.g., README, ADR, or inline comments in relevant integration code)
- There's a plan to either upstream these changes or periodically sync security patches from upstream
- The specific changes in the fork vs. upstream are tracked and reviewable
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/src-tauri/vendor/tauri-cef` at line 1, Document that the project uses a custom fork named "tinyhumansai/tauri-cef" (the tauri-cef submodule) because of added audio handler functionality; add a short rationale and summary of the fork (what was changed, why) into the repo README or an ADR, add an UPSTREAM_SYNC or MAINTENANCE plan that states whether changes will be upstreamed or how/when security patches will be pulled from upstream, and ensure the exact diffs between upstream tauri-cef and the fork are tracked (e.g., a CHANGELOG or a git-diff snapshot and a pointer to the submodule commit) so reviewers can inspect the audio handler changes and follow the syncing/upstreaming plan.tests/json_rpc_e2e.rs (1)
3931-4094: 🏗️ Heavy liftAdd one E2E for
push_caption/ wake-word flow as well.This test only covers the legacy PCM/VAD lifecycle. The shipped Meet listen path in this PR is caption-driven, so a regression in wake-word assembly/cooldown could still pass CI. A focused
push_caption→poll_speechJSON-RPC test would lock down the path the shell actually uses.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/json_rpc_e2e.rs` around lines 3931 - 4094, Add a new tokio::test that mirrors json_rpc_meet_agent_session_lifecycle but exercises the caption/wake-word flow: call openhuman.meet_agent_start_session (same setup), then send a wake-word caption via openhuman.meet_agent_push_caption (use the same request_id pattern), poll openhuman.meet_agent_poll_speech until non-empty pcm_base64 is returned, then call openhuman.meet_agent_stop_session and assert listened_seconds > 0 and turn_count == 1; ensure you also test that stopping a non-existent session errors (reuse the bogus stop check). Locate the new test near json_rpc_meet_agent_session_lifecycle and reuse helpers post_json_rpc, assert_no_jsonrpc_error, assert_jsonrpc_error and the same B64/EnvVar setup so it runs under the same ephemeral RPC server harness.app/src-tauri/src/meet_audio/speak_pump.rs (1)
57-57: 💤 Low valueRedundant
mutrebinding.
cdpis already owned and mutable; the rebinding adds noise.Suggested fix
- let mut cdp = cdp; let mut feed_errors: u32 = 0; + let mut cdp = cdp;Actually, since
cdpis passed by value and used mutably, you can just remove the rebinding entirely and usecdpdirectly (it's already mutable by ownership):- let mut cdp = cdp;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/src-tauri/src/meet_audio/speak_pump.rs` at line 57, Remove the redundant rebinding "let mut cdp = cdp;" in speak_pump.rs and use the existing owned mutable variable cdp directly; locate the usage in the function or block where cdp is passed by value (search for the symbol cdp and the rebinding) and delete that line so subsequent mutable operations reference the original cdp binding.app/src-tauri/src/meet_audio/mod.rs (1)
246-249: ⚡ Quick winCreating a new HTTP client per RPC call is inefficient.
reqwest::Clientmaintains a connection pool and is designed to be reused. Creating a new client for everyrpc_callbypasses connection reuse and adds overhead.Consider using a
OnceCell<reqwest::Client>or passing a shared client.Suggested fix
+use std::sync::OnceLock; + +static RPC_CLIENT: OnceLock<reqwest::Client> = OnceLock::new(); + +fn get_rpc_client() -> Result<&'static reqwest::Client, String> { + Ok(RPC_CLIENT.get_or_init(|| { + reqwest::Client::builder() + .timeout(std::time::Duration::from_secs(10)) + .build() + .expect("failed to build HTTP client") + })) +} + pub(crate) async fn rpc_call( method: &str, params: serde_json::Value, ) -> Result<serde_json::Value, String> { // ... let url = crate::core_rpc::core_rpc_url_value(); - let client = reqwest::Client::builder() - .timeout(std::time::Duration::from_secs(10)) - .build() - .map_err(|e| format!("http client: {e}"))?; + let client = get_rpc_client()?; let req = crate::core_rpc::apply_auth(client.post(&url))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/src-tauri/src/meet_audio/mod.rs` around lines 246 - 249, The code currently builds a new reqwest::Client inside the rpc call (the block that does Client::builder().timeout(...).build()), which prevents connection reuse; change this to use a shared client instead—either introduce a static OnceCell<reqwest::Client> (e.g. a global CLIENT initialized once and .get_or_init(...) used where the builder is now) or modify the caller signature to accept &reqwest::Client and remove the per-call build; update all references that call the rpc routine to use the shared client and remove the map_err(build) path so errors only occur on initial client creation.src/openhuman/meet_agent/rpc.rs (1)
64-70: ⚖️ Poor tradeoffSpawned brain turn panics are silently swallowed.
If
brain::run_turnpanics, thetokio::spawnwill abort that task but the error won't surface anywhere. Consider usingspawnwith a catch-unwind wrapper or at minimum logging task completion/failure.Suggested improvement
let request_id = req.request_id.clone(); tokio::spawn(async move { - if let Err(err) = brain::run_turn(&request_id).await { - log::warn!("{LOG_PREFIX} brain turn failed request_id={request_id} err={err}"); - } + match std::panic::AssertUnwindSafe(brain::run_turn(&request_id)) + .catch_unwind() + .await + { + Ok(Ok(())) => {} + Ok(Err(err)) => { + log::warn!("{LOG_PREFIX} brain turn failed request_id={request_id} err={err}"); + } + Err(_) => { + log::error!("{LOG_PREFIX} brain turn panicked request_id={request_id}"); + } + } });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/meet_agent/rpc.rs` around lines 64 - 70, The spawned task using tokio::spawn for brain::run_turn(&request_id) can panic and the panic will be silently aborted; wrap the run_turn call in a panic-safe wrapper (use std::panic::AssertUnwindSafe + tokio::spawn(async move { let result = tokio::spawn or futures::FutureExt::catch_unwind on the async block }).await or futures::FutureExt::catch_unwind) and log both panic and Err cases so failures surface: call brain::run_turn(&request_id).await inside a catch_unwind, then log panics with the request_id and log the Err variant as you already do, ensuring the tokio::spawn body handles and records both panic and normal error outcomes.app/src-tauri/src/meet_audio/captions_bridge.js (2)
150-161: ⚡ Quick winRedundant duplicate condition.
Line 153 checks the same condition twice:
lbl.indexOf("turn on captions") === 0and/^turn on captions/.test(lbl)are equivalent. The regex test is unnecessary.Suggested fix
- if (lbl.indexOf("turn on captions") === 0 || /^turn on captions/.test(lbl)) { + if (lbl.indexOf("turn on captions") === 0) {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/src-tauri/src/meet_audio/captions_bridge.js` around lines 150 - 161, The loop in captions_bridge.js has a redundant condition checking the same thing twice; replace the combined test (lbl.indexOf("turn on captions") === 0 || /^turn on captions/.test(lbl)) with a single clear check such as lbl.startsWith("turn on captions") (or keep lbl.indexOf(...) === 0) to remove the duplicate regex test, leaving the click/enableAttempts/return behavior unchanged for the function that iterates over buttons and uses enableAttempts/ENABLE_ATTEMPT_BUDGET.
40-45: 💤 Low valueUnbounded speaker tracking map may grow over long calls.
lastBySpeakeraccumulates entries for every distinct speaker name encountered and is never pruned. In a long meeting with many participants or speaker-name churn, this could grow indefinitely.Consider periodically pruning entries older than a few seconds, or using a bounded LRU-style map.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/src-tauri/src/meet_audio/captions_bridge.js` around lines 40 - 45, lastBySpeaker currently grows without bounds; update the captions handling to track a timestamp per speaker in lastBySpeaker and prune stale entries (e.g., remove keys older than N seconds) or replace lastBySpeaker with a small bounded LRU map implementation so the map never exceeds a max size. Specifically, modify the code paths that update lastBySpeaker (the caption enqueue/emit logic) to record Date.now() alongside the fingerprint and run a lightweight prune step (or LRU eviction) before inserting new speakers so old/rare speakers are removed automatically.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@app/src-tauri/src/meet_audio/audio_bridge.js`:
- Around line 33-171: Add stable "[openhuman-audio-bridge]" console logs at key
entry/exit and branch points: log on initial install when setting
window.__openhumanAudioBridgeInstalled, inside ensureContext() indicating
creating vs reusing AudioContext and its resulting state, in
navigator.mediaDevices.getUserMedia override to log whether an audio-only
request was intercepted vs audio+video (include constraints), and after splicing
tracks when combining streams; also log in window.__openhumanFeedPcm when a feed
is received (size/duration) and keep the existing catch log but include the same
tag; finally log when the legacy navigator.getUserMedia alias is patched and
when navigator.mediaDevices is absent so installers can see why interception
didn’t occur.
- Around line 137-155: The current code returns the shared dest.stream
(ourStream) which is reused across getUserMedia calls and can be permanently
stopped; instead, create a fresh MediaStream for each request by cloning the
destination tracks via track.clone() and adding those clones to a new
MediaStream; when constraints.video is false return
Promise.resolve(newMediaStreamWithClonedAudio), and when combining with
origGum({ video: ... }) clone the destination audio tracks and add the clones to
the realStream (use realStream.addTrack(clone)) rather than moving/returning the
singleton dest.stream or its original tracks.
In `@app/src-tauri/src/meet_audio/caption_listener.rs`:
- Around line 109-122: The loop in caption_listener.rs currently swallows
failures from super::rpc_call("openhuman.meet_agent_push_caption") by logging
and returning Ok(()), which prevents MAX_CONSECUTIVE_ERRORS from ever
incrementing; instead propagate the failure to the caller: replace the
debug-only handling in the for loop (the block around super::rpc_call / res) so
that a failed rpc_call returns Err(...) from the enclosing function (or use the
? operator) with a clear context message referencing request_id and the rpc
error; ensure the function signature supports returning that error type so the
outer task can back off/terminate.
In `@app/src-tauri/src/meet_audio/listen_capture.rs`:
- Around line 117-129: The current code splits each incoming pcm_bytes but does
not accumulate undersized CEF packets nor evict oldest entries when the channel
is full; change the logic around resampler.lock()/feed_and_drain, FLUSH_SAMPLES
and tx.try_send to use a persistent pending buffer (e.g., a Vec<u8> or VecDeque
stored outside the per-callback scope) that appends successive pcm_bytes until
its length >= FLUSH_SAMPLES * 2, then emits fixed-size chunks; when attempting
to forward a chunk with tx.try_send (the code currently at the try_send/ log
block referencing request_id) and it returns Err (channel full), implement an
overwrite- oldest policy by dropping from the pending buffer (pop_front) or
otherwise evicting the oldest buffered chunk and retrying send so the newest
audio is pushed, and apply the same persistent-buffer+evict-oldest change to the
other similar block around lines 219-266.
In `@docs/MEET_AGENT_SMOKE.md`:
- Around line 30-56: The runbook currently validates the old pre-wake-word/STT
flow and will mislead testers: update the examples and checks to reflect the new
caption-driven, wake-word gated path by (1) changing Step 4 sample prompts to
include the wake phrase "hey openhuman" (or variations) and (2) updating the
"Listen path" checks to avoid expecting the legacy
push_listen_pcm/handle_push_listen_pcm and STT logs and instead mention caption
events and wake-word gating logs (referencing the existing log lines like
`[meet-agent] turn done` and any caption-related log entries); ensure the doc
explicitly states that absence of `cef stream start` or `forward channel push`
may be expected unless wake-word is spoken and that testers should look for
caption/wake-word related log entries instead.
In `@src/openhuman/meet_agent/brain.rs`:
- Around line 43-45: The code currently hard-codes SAMPLE_RATE_HZ = 16_000 (and
uses MIN_TURN_SAMPLES derived from it) but the public API accepts variable
sample_rate_hz; update the module to either enforce 16 kHz at the session
boundary or use the session's actual sample rate throughout: in start_session
(validate sample_rate_hz and return an error if it is not 16_000) OR change all
uses of SAMPLE_RATE_HZ and MIN_TURN_SAMPLES (and any WAV packing, duration
calculation, and turn-floor sizing logic referenced in functions that produce
WAVs and compute timings around lines ~246-260 and ~356-359) to accept and use a
per-session sample_rate_hz from the session struct/params (pass it into helpers
and recompute derived sample counts accordingly). Ensure all places that write
WAV headers, compute durations, or derive sample counts use the session-level
sample_rate_hz instead of the constant.
In `@src/openhuman/meet_agent/schemas.rs`:
- Around line 298-312: Rename the five delegator functions currently named
wrap_start_session, wrap_push_listen_pcm, wrap_push_caption, wrap_poll_speech,
and wrap_stop_session to the repo-standard names handle_start_session,
handle_push_listen_pcm, handle_push_caption, handle_poll_speech, and
handle_stop_session respectively; keep the signature (fn NAME(p: Map<String,
Value>) -> ControllerFuture) and body delegating to super::rpc::handle_* (e.g.,
Box::pin(async move { super::rpc::handle_start_session(p).await })) unchanged,
and update any registry entries (e.g., all_registered_controllers or schemas
lists) that referenced the old wrap_* symbols to the new handle_* symbols so the
domain follows the schema-module contract.
---
Nitpick comments:
In `@app/src-tauri/src/meet_audio/captions_bridge.js`:
- Around line 150-161: The loop in captions_bridge.js has a redundant condition
checking the same thing twice; replace the combined test (lbl.indexOf("turn on
captions") === 0 || /^turn on captions/.test(lbl)) with a single clear check
such as lbl.startsWith("turn on captions") (or keep lbl.indexOf(...) === 0) to
remove the duplicate regex test, leaving the click/enableAttempts/return
behavior unchanged for the function that iterates over buttons and uses
enableAttempts/ENABLE_ATTEMPT_BUDGET.
- Around line 40-45: lastBySpeaker currently grows without bounds; update the
captions handling to track a timestamp per speaker in lastBySpeaker and prune
stale entries (e.g., remove keys older than N seconds) or replace lastBySpeaker
with a small bounded LRU map implementation so the map never exceeds a max size.
Specifically, modify the code paths that update lastBySpeaker (the caption
enqueue/emit logic) to record Date.now() alongside the fingerprint and run a
lightweight prune step (or LRU eviction) before inserting new speakers so
old/rare speakers are removed automatically.
In `@app/src-tauri/src/meet_audio/mod.rs`:
- Around line 246-249: The code currently builds a new reqwest::Client inside
the rpc call (the block that does Client::builder().timeout(...).build()), which
prevents connection reuse; change this to use a shared client instead—either
introduce a static OnceCell<reqwest::Client> (e.g. a global CLIENT initialized
once and .get_or_init(...) used where the builder is now) or modify the caller
signature to accept &reqwest::Client and remove the per-call build; update all
references that call the rpc routine to use the shared client and remove the
map_err(build) path so errors only occur on initial client creation.
In `@app/src-tauri/src/meet_audio/speak_pump.rs`:
- Line 57: Remove the redundant rebinding "let mut cdp = cdp;" in speak_pump.rs
and use the existing owned mutable variable cdp directly; locate the usage in
the function or block where cdp is passed by value (search for the symbol cdp
and the rebinding) and delete that line so subsequent mutable operations
reference the original cdp binding.
In `@app/src-tauri/vendor/tauri-cef`:
- Line 1: Document that the project uses a custom fork named
"tinyhumansai/tauri-cef" (the tauri-cef submodule) because of added audio
handler functionality; add a short rationale and summary of the fork (what was
changed, why) into the repo README or an ADR, add an UPSTREAM_SYNC or
MAINTENANCE plan that states whether changes will be upstreamed or how/when
security patches will be pulled from upstream, and ensure the exact diffs
between upstream tauri-cef and the fork are tracked (e.g., a CHANGELOG or a
git-diff snapshot and a pointer to the submodule commit) so reviewers can
inspect the audio handler changes and follow the syncing/upstreaming plan.
In `@src/openhuman/meet_agent/rpc.rs`:
- Around line 64-70: The spawned task using tokio::spawn for
brain::run_turn(&request_id) can panic and the panic will be silently aborted;
wrap the run_turn call in a panic-safe wrapper (use std::panic::AssertUnwindSafe
+ tokio::spawn(async move { let result = tokio::spawn or
futures::FutureExt::catch_unwind on the async block }).await or
futures::FutureExt::catch_unwind) and log both panic and Err cases so failures
surface: call brain::run_turn(&request_id).await inside a catch_unwind, then log
panics with the request_id and log the Err variant as you already do, ensuring
the tokio::spawn body handles and records both panic and normal error outcomes.
In `@tests/json_rpc_e2e.rs`:
- Around line 3931-4094: Add a new tokio::test that mirrors
json_rpc_meet_agent_session_lifecycle but exercises the caption/wake-word flow:
call openhuman.meet_agent_start_session (same setup), then send a wake-word
caption via openhuman.meet_agent_push_caption (use the same request_id pattern),
poll openhuman.meet_agent_poll_speech until non-empty pcm_base64 is returned,
then call openhuman.meet_agent_stop_session and assert listened_seconds > 0 and
turn_count == 1; ensure you also test that stopping a non-existent session
errors (reuse the bogus stop check). Locate the new test near
json_rpc_meet_agent_session_lifecycle and reuse helpers post_json_rpc,
assert_no_jsonrpc_error, assert_jsonrpc_error and the same B64/EnvVar setup so
it runs under the same ephemeral RPC server harness.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9eeb6c72-1f31-431b-885d-fc92db1f6324
📒 Files selected for processing (24)
app/src-tauri/src/lib.rsapp/src-tauri/src/meet_audio/audio_bridge.jsapp/src-tauri/src/meet_audio/caption_listener.rsapp/src-tauri/src/meet_audio/captions_bridge.jsapp/src-tauri/src/meet_audio/inject.rsapp/src-tauri/src/meet_audio/listen_capture.rsapp/src-tauri/src/meet_audio/mod.rsapp/src-tauri/src/meet_audio/speak_pump.rsapp/src-tauri/src/meet_call/mod.rsapp/src-tauri/vendor/tauri-cefdocs/MEET_AGENT_SMOKE.mdsrc/core/all.rssrc/openhuman/about_app/catalog.rssrc/openhuman/about_app/catalog_tests.rssrc/openhuman/meet_agent/brain.rssrc/openhuman/meet_agent/mod.rssrc/openhuman/meet_agent/ops.rssrc/openhuman/meet_agent/rpc.rssrc/openhuman/meet_agent/schemas.rssrc/openhuman/meet_agent/session.rssrc/openhuman/meet_agent/types.rssrc/openhuman/meet_agent/wav.rssrc/openhuman/mod.rstests/json_rpc_e2e.rs
- audio_bridge.js: clone destination tracks per getUserMedia call so Meet's track.stop() can't permanently kill the bridge; add stable [openhuman-audio-bridge] logs for install / context creation / interception branches / sampled feed cadence. - caption_listener.rs: bubble push_caption RPC failures up so MAX_CONSECUTIVE_ERRORS can trip; previously a broken core session silently dropped captions forever. - meet_agent::ops: lock validate_sample_rate to 16 kHz exactly (REQUIRED_SAMPLE_RATE) since brain.rs hard-codes the rate throughout (WAV header, MIN_TURN_SAMPLES, listened_seconds). brain.rs now sources the constant from ops so any future loosening of the boundary breaks the math at compile time. - meet_agent/schemas.rs: rename wrap_* delegators to handle_* per the per-domain schemas.rs convention noted in CLAUDE.md. - docs/MEET_AGENT_SMOKE.md: rewrite Step 4 + Listen path checks for the caption-driven flow (wake-word phrases, captions drained / wake word fired / caption turn done log lines, __openhumanCaptionsBridgeInfo introspection); call out that cef stream start / push_listen_pcm logs are NOT expected on the active path. Dismissed (replied in thread): listen_capture.rs chunking / backpressure suggestion — that module is now the inactive _legacy_listen field; live listen path is captions-driven. Will revisit if/when we re-enable CEF audio.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/meet_agent/schemas.rs`:
- Around line 152-153: The schema description for push_caption in
src/openhuman/meet_agent/schemas.rs is stale—update the description string for
the push_caption endpoint/field to remove the claim that the wake-word dispatch
triggers an “LLM/TTS turn” and instead state that the wake-word gate triggers a
deterministic/canned hot-path response (or otherwise reflect current non-LLM
behavior); locate the push_caption description literal and replace the text
accordingly so API docs and operator expectations match runtime behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: fdaa81ae-767f-4a1d-8f6d-55b1af967699
📒 Files selected for processing (6)
app/src-tauri/src/meet_audio/audio_bridge.jsapp/src-tauri/src/meet_audio/caption_listener.rsdocs/MEET_AGENT_SMOKE.mdsrc/openhuman/meet_agent/brain.rssrc/openhuman/meet_agent/ops.rssrc/openhuman/meet_agent/schemas.rs
✅ Files skipped from review due to trivial changes (3)
- app/src-tauri/src/meet_audio/caption_listener.rs
- docs/MEET_AGENT_SMOKE.md
- app/src-tauri/src/meet_audio/audio_bridge.js
| description: "Push a caption line scraped from Meet's live captions DOM. The wake-word \ | ||
| gate (\"hey openhuman\") triggers an LLM/TTS turn when fired.", |
There was a problem hiding this comment.
Schema description is stale about LLM usage.
The push_caption description says wake-word dispatch triggers an “LLM/TTS turn”, but this PR’s behavior is deterministic/canned in the hot path. Updating this text will prevent misleading API docs and operator expectations.
✏️ Suggested text update
- description: "Push a caption line scraped from Meet's live captions DOM. The wake-word \
- gate (\"hey openhuman\") triggers an LLM/TTS turn when fired.",
+ description: "Push a caption line scraped from Meet's live captions DOM. The wake-word \
+ gate (\"hey openhuman\") triggers a reply/TTS turn when fired.",📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| description: "Push a caption line scraped from Meet's live captions DOM. The wake-word \ | |
| gate (\"hey openhuman\") triggers an LLM/TTS turn when fired.", | |
| description: "Push a caption line scraped from Meet's live captions DOM. The wake-word \ | |
| gate (\"hey openhuman\") triggers a reply/TTS turn when fired.", |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/openhuman/meet_agent/schemas.rs` around lines 152 - 153, The schema
description for push_caption in src/openhuman/meet_agent/schemas.rs is
stale—update the description string for the push_caption endpoint/field to
remove the claim that the wake-word dispatch triggers an “LLM/TTS turn” and
instead state that the wake-word gate triggers a deterministic/canned hot-path
response (or otherwise reflect current non-LLM behavior); locate the
push_caption description literal and replace the text accordingly so API docs
and operator expectations match runtime behavior.
Summary
src/openhuman/meet_agent/(5 RPC methods, schemas, session registry, wake-word state machine) wired into the controller registry.app/src-tauri/src/meet_audio/(CDP-injected Web Audio bridge for the speak path, captions DOM observer for the listen path, auto-CC, lifecycle wired intomeet_call_open_window).audiomodule in vendoredtauri-runtime-cefexposing per-browser CEF audio handlers via a URL-prefix registry (initially used; later superseded by the captions path but kept for a future opt-in).Problem
The existing
feat(meet)PR (#1350) lets the agent join a Meet call as an anonymous guest with the mascot as a virtual webcam, but the agent has no way to listen to the call or speak back. The user wants the agent to capture action items / notes during a meeting ("hey openhuman, remember to email Bob about the launch") so they can be surfaced post-meeting, and to acknowledge briefly in-call so the user knows it caught the dictation.Constraints from the design discussion:
getUserMediasuch that other participants hear it.Solution
Listen — Meet's built-in captions (after a short detour through CEF audio capture, see below):
app/src-tauri/src/meet_audio/captions_bridge.jsruns at document-start in the embedded Meet page (installed via CDPPage.addScriptToEvaluateOnNewDocument+Page.reload). It auto-clicks "Turn on captions", attaches aMutationObserver(with a 250 ms safety poll) over thearia-label=\"Captions\"region, and queues new lines.caption_listener.rspollswindow.__openhumanDrainCaptions()every 500 ms and forwards lines toopenhuman.meet_agent_push_caption.meet_agent::session::note_captionnormalizes punctuation ("hey, openhuman" → "hey openhuman"), strips the wake phrase, buffers continuation captions for 1.5 s, then dispatches a brain turn. Includes an 8 s post-turn cooldown so Meet's lingering finalised caption (visible for 5–8 s) doesn't re-fire the wake word on every dedupe-then-grow cycle.Speak — CDP-injected Web Audio bridge:
audio_bridge.jsbuilds a 16 kHzMediaStreamAudioDestinationNodeand monkey-patchesnavigator.mediaDevices.getUserMediaso audio requests get our destination stream (delegating video to the original so Chromium's fake-camera Y4M still renders the mascot). Exposeswindow.__openhumanFeedPcm(b64)for the shell to push PCM into.speak_pump.rspollsmeet_agent_poll_speechevery 100 ms and feeds each chunk viaRuntime.evaluateon a long-lived CDP session.voice::reply_speechwithoutput_format=pcm_16000so ElevenLabs (via the hosted backend) returns bytes the bridge can play with no transcoding.Brain — canned acks, no LLM in the hot path:
Heardevent on the session transcript (post-meeting summarisation can run an LLM offline against it).agentic-v1/summarization-v1; agentic emitted CoT into TTS, summarization returned empty for short prompts.)Why captions, not CEF audio?
cef_audio_handler_tfor listen + a CDP-injected bridge for speak.get_audio_handlerlazily (only when audio output starts), so a solo agent in a lobby or pre-admit window never engages the pipeline. Captions handle that case for free — Meet's STT is already running, speaker-attributed, and pre-segmented.tauri-runtime-cef::audio+meet_audio::listen_capture) is kept in the tree as an inactive_legacy_listenfield so re-enabling it later is a single wire change.Submission Checklist
docs/TESTING-STRATEGY.mdmeet_audiomodule is mostly CDP plumbing whose meaningful behaviour requires a real browser. Unit tests cover the resampler (irrelevant to the captions path now), the WAV header builder, the session wake-word state machine (punctuation, double-fire, cooldown), and the JSON-RPC E2E for the start/push/poll/stop lifecycle. The page-side JS (audio_bridge.js,captions_bridge.js) is exercised by the smoke runbook (docs/MEET_AGENT_SMOKE.md).meet_agentare not yet present indocs/TEST-COVERAGE-MATRIX.md; leaving as a follow-up since the matrix scope predates this PR's domain.docs/TESTING-STRATEGY.md)docs/MEET_AGENT_SMOKE.md; folding it intodocs/RELEASE-MANUAL-SMOKE.mdis a follow-up once the feature graduates from beta.Impact
meet_call_open_window; non-meet windows are unaffected because the audio handler registry is keyed by URL prefix and the bridges only install in the meet-call CEF target.acct_*family). The Meet-call window is a distinct top-level surface for a single audio-bridging purpose. The user explicitly authorized this injection for the speak + captions paths; the no-JS rule foracct_*webviews is unchanged.tinyhumansai/tauri-cef@feat/openhuman-audio-handlerto pick up the newaudiomodule.d152cddc) and pushed normally — no--no-verifyused.Related
Builds on feat(meet): join Google Meet calls with mascot virtual camera #1350 (feat(meet): join Google Meet calls with mascot virtual camera).
Submodule branch: tinyhumansai/tauri-cef#feat/openhuman-audio-handler — must merge first.
Closes:
Follow-up PR(s)/TODOs:
meet_audioindocs/TEST-COVERAGE-MATRIX.md.AI Authored PR Metadata (required for Codex/Linear PRs)
Linear Issue
Commit & Branch
Validation Run
Validation Blocked
Behavior Changes
Parity Contract
meet_call_open_windowstill navigates the dedicated CEF window to the Meet URL with isolated profile and runsmeet_scannerfor join automation. The newmeet_audio::start/stopcalls are additive and best-effort (failures are logged and don't block window lifecycle).MeetAudioSessionlifecycle stays uniform; the captions bridge's auto-CC click attempts cap at 30 (~60 s) so a user who deliberately disables CC is respected.Summary by CodeRabbit
New Features
Documentation
Tests