Use UIAgent#1
Open
markbackman wants to merge 22 commits into
Open
Conversation
App root calls useA11ySnapshot so the client streams accessibility
snapshots to the server on mount, DOM mutations, focus, scrollend,
resize, and visibility change. Grid component adds role=grid +
aria-colcount so the agent can resolve position references ("top
right", "the first one") directly from [cols=N] in <ui_state>.
Server: on_task_request injects the latest snapshot just-in-time,
so the LLM always reasons over the current screen. Removed the
stale _inject_ui_update calls from _emit_* and _respond, which had
been re-injecting prose descriptions into the UI agent's own
context one tick before the client re-rendered. UI agent prompt
updated to describe the Playwright-MCP tree format and the
[offscreen] / [cols=N] semantics.
UIAgent base now auto-injects <ui_state> on task request, so the music-player override no longer calls inject_ui_state() manually. System prompt's UI-context section replaced with the SDK-exported UI_STATE_PROMPT_GUIDE constant, so format updates flow through with the pipecat-subagents version. Deletes the spike-only helpers: _debug_snapshot, _inject_ui_update, and the four _describe_*_screen methods (home/artist/detail/trending) plus the _describe_grid helper. Tool return values for navigate_to_artist and select_item now use short strings that the voice agent paraphrases for confirmation; positional context lives in <ui_state> alone. Removes log_snapshots=True from the UIAgent constructor call (flag no longer exists on the base class).
When the voice agent dispatches ``answer_about_music("what's the best
song on this album?", about="Starboy")``, the inner LLM call now
sees which album the question is about. Previously the ``about``
field was only used for toast rendering and never reached the
prompt, so the inner LLM asked the user to clarify even when the
UI agent had already resolved "this album" correctly.
- ``descriptions.answer_question`` takes ``about`` and optional
``about_tracks`` kwargs. The prompt templates gain a
``{focus_section}`` slot that renders a "Focus item" line plus,
for albums, the tracklist. The inner LLM uses this to resolve
deixis and reason about track-level questions.
- ``_answer_question`` resolves the referenced album via the new
``_resolve_about_tracks`` helper. It prefers the already-cached
tracks on the artist's album record (populated by ``_emit_detail``)
and falls back to fetching from the catalog on demand.
Exercises the two SDK capabilities that weren't yet showcased:
the LLM can now scroll offscreen items into view and visually
point at elements by ref.
Server (``server/ui_agent.py``)
- ``UIAgent`` inherits ``ScrollToToolMixin`` and ``HighlightToolMixin``
alongside the base ``UIAgent``. MRO picks up both tools so the
LLM sees ``scroll_to(ref)`` and ``highlight(ref)``.
- System prompt documents when to call each: ``scroll_to`` for
``[offscreen]`` elements, ``highlight`` for "point at / which one
is X" turns.
- New ``_seed_demo_favorites()`` populates the Favorites grid at
init with Radiohead's "In Rainbows" and Bad Bunny's "DeBÍ TiRAR
MáS FOToS" so "scroll to my favorites" lands on content instead
of the empty-state placeholder.
Client
- ``useServerMessages`` replaces the hand-rolled
``[data-scroll-target=...]`` handler with
``useStandardScrollToHandler({ block: "center", container: () =>
document.querySelector(".main") })``. The standard handler
resolves ref first (what the LLM sends) and scrolls inside the
overflow container so the sticky header is cleared.
- Adds ``useStandardHighlightHandler({ className: "ui-highlight",
defaultDurationMs: 2000, scrollIntoViewFirst: true })``. Offscreen
targets auto-scroll into view before flashing.
- ``.ui-highlight`` CSS: 3px gold ring with a glow + a subtle 3%
scale pulse, fades over 2s. Bright enough to read on a shared
screen.
Demo script (6 turns, each hits a distinct capability):
1. "Show me Radiohead" — nav (existing)
2. "Which album is OK Computer?" — highlight
3. "Play the last song on this album" — scroll_to → play
4. "Go back" — nav (existing)
5. "Which one is The Weeknd?" — highlight on Trending
6. "Scroll to my favorites" — scroll_to (seeded content lands)
Drops the per-app task-tracking boilerplate now that the SDK's UIAgent records the in-flight task and exposes respond_to_task(). _respond is now a thin wrapper that sets the music-player's `description` field and delegates to respond_to_task. The local scroll_to/highlight overrides are gone; the SDK mixin tools complete the task silently and the voice agent has a third branch that emits no TTS when the response is empty. Also collapses answer_about_catalog/answer_about_music into a single answer(text, about=None) tool that writes the spoken reply inline, grounded by <ui_state> and the model's training knowledge. Removes the now-unused answer_question helper and prompts from descriptions.py.
…highlight The SDK action mixins (ScrollToToolMixin, HighlightToolMixin) are now pure chainable side effects: they dispatch a UI command and leave the task open so the LLM can chain another tool in the same turn. This app's prompt is "exactly one tool per turn", so the chainable shape doesn't fit. Drop the mixin imports and define scroll_to / highlight locally as silent terminators: send_command(...) + respond_to_task() + result_callback(None). Behavior is unchanged from before the SDK refactor; the visual change on the client is still the user-facing feedback and the voice agent stays quiet for that turn. README updated to reflect the local definition.
The SDK now exposes scroll_to(ref) and highlight(ref) as plain instance methods on UIAgent (wrapping send_command + the standard payload dataclasses). The local @tool overrides here can delegate to them via super(), dropping the direct send_command + dataclass construction. Drop the unused Highlight import (no longer referenced in this file). ScrollTo stays — it's still used elsewhere. The "one tool per turn" design unchanged: scroll_to and highlight remain @tool methods that call respond_to_task() to silently terminate. ReplyToolMixin is not composed here because this app's tool surface is many distinct named tools, not the canonical answer + visuals bundle.
Music browsing is multi-turn by nature. The user says "show me Nirvana" → "play their best album" → "skip that one and try the next" — each follow-up references something from the prior exchange. With keep_history=False the LLM saw only the current <ui_state> per turn and couldn't ground these references. Switch the UI agent to keep_history=True so it accumulates conversation history. To bound growth over long sessions, enable auto context summarization on the assistant aggregator (LLMAssistantAggregatorParams.enable_auto_context_summarization= True). Thresholds: 8000 tokens or 20 unsummarized messages trigger a summary; the summary targets 6000 tokens and keeps the last 4 messages verbatim so the most recent dialog stays intact for the model. The aggregator handles summarization internally — the LLM service that already runs on the agent generates the summary inline when the request frame fires. No pipeline override or extra processor needed; the right hook was the assistant aggregator's params, which LLMContextAgent already accepts as a constructor kwarg. Adds an on_summary_applied event handler in on_ready() that logs when a summary is applied, with before/after message counts. Useful for understanding session dynamics in long traces.
…y screen User says "find me music like Radiohead" and the UI agent fans out to three worker recommenders in parallel. Each worker pulls candidate artists from a different angle, fetches their top tracks through the existing CatalogAgent, and streams tracks back as they arrive. The UI agent translates each streamed track into an add_track UI command, which the client renders into a new Discovery screen. Server-side pieces: - discovery_workers.py: three BaseAgent subclasses (SimilarArtistRecommender, GenreRecommender, ChartRecommender), each overriding find_candidate_artists() to pull candidates from a different catalog action (related_artists, get_trending(genre), get_trending(None)). The base streams tracks via send_task_update with kind="track" so they arrive incrementally. - ui_agent.py: - new start_discovery(seed_artist) @tool that resolves the seed, pushes a Discovery NavFrame, and fires start_user_task_group(...) against the three worker names — fire-and-forget so the voice agent unblocks while workers run. - on_task_update interception: when a registered discovery group streams a track update, emit a custom add_track UI command carrying the track payload + source name. Other update kinds (free-form progress text) flow through unchanged via the standard task lifecycle forwarding. - new "discovery" Screen value + Discovery NavFrame fields (seed id + name) + _emit_discovery() that pushes the screen + dispatch in _emit_for_top so reconnects re-emit correctly. - new @on_ui_event("track_click") + _handle_discovery_track_click that resolves the artist, finds the song in its catalog record, and plays via _do_play. Re-clicking the active track stops playback (parity with the existing play_track handler). - prompt update: documents start_discovery so the LLM knows when to reach for it. - bot.py: registers the three workers alongside voice + ui in on_client_ready, so they're available as task targets when start_user_task_group dispatches. Workers don't talk to Deezer directly. Every catalog lookup goes through the long-lived CatalogAgent, the same data layer the rest of the app uses, so caching + rate-limiting stay centralized. This adds a real demonstration of start_user_task_group + the four ui.task envelopes + cancellation in a production-shaped app on top of real Deezer data — the SDK feature with the biggest gap in music-player's prior coverage. Client-side pieces (Discovery screen + add_track command handler + track_click event emit) follow in the next commit.
Companion to the server-side discoveries flow. The client now renders a Discovery screen, accumulates streamed tracks, exposes a Cancel button driven by useUITasks, and emits track_click events on card press. Type changes (types.ts): - New "discovery" Screen variant with seedArtist + backEnabled. - New DiscoveryTrack interface mirroring the server's add_track payload (with a "source" field naming the worker that surfaced it: similar_artist / genre / chart). - New "track_click" ClickEvent variant for discovery card clicks. State (hooks/useServerMessages.ts): - discoveryTracks: DiscoveryTrack[] accumulated from add_track custom commands. Deduped by track id (workers occasionally surface the same track from different angles). Cleared on a new screen=discovery push so each session starts empty. - "discovery" branch of the screen command handler stores the seed artist + clears discoveryTracks for the new session. - New AddTrackPayload type + useUICommandHandler<AddTrackPayload> registration for the add_track command. Screen component (screens/Discovery.tsx): - Reads in-flight task group state via useUITasks() (the React UITasksProvider subscribes to the SDK's ui.task envelopes). Picks the most recent group whose label starts with "Discoveries:" as the active session. Per-worker rows show running/completed/ cancelled status + the latest update text. Cancel button calls cancelTask(taskId). - Track grid fills from props as add_track commands arrive. Cards show cover, title, artist, and a per-source label. Click → fires the new track_click event. App.tsx: - Wraps the existing UIAgentProvider in UITasksProvider so useUITasks works. - Renders Discovery screen for screen.kind === "discovery". - Threads discoveryTracks down as a prop. - Includes "discovery" in the back-enabled and screenIdentity switches so the existing scroll-reset and back-button behaviors cover it too. Styles (index.css): self-contained .discovery-* classes for the header, in-flight panel, worker rows, track grid, and per-source accent colors.
- Workers now all answer "what's like X": drop chart recommender, add two-hop (related-of-related), fix genre worker by reading genre_id off the artist's releases (Deezer's top-tracks endpoint strips the album subobject, so my earlier path returned None for everyone). - Cap concurrent Deezer requests with a module-level semaphore so fan-out + warm-up can't trip the IP rate limit; raise per-task catalog timeout to 15s to absorb queueing during cold cascades. - Drop the heavy home warm-up. Build artist records lazily on click so user-initiated work owns the rate-limit budget. Cold home click is now ~1-3s; cold discovery is ~3-5s instead of ~12s. - start_discovery pushes a placeholder Discovery screen and speaks the ack first, resolves the seed second, fires workers third. The user gets visible+audible feedback within 2-3s instead of 13s of silence on a cold seed. - Tighten the UI agent prompt so a named seed routes to start_discovery; previously "show me artists like Radiohead" fell through to navigate_to_artist. - Grid goes to 8 columns; pull 3 tracks per artist for fuller fills; rename pill labels to "Genre peers" and "More like these".
Architecture diagram, features list, and reference patterns all ignored discovery; "things to try" was missing the new prompt. Adds: - Two-pattern framing in the opening (voice/UI separation + parallel fan-out). - Discovery workers branch in the architecture diagram, plus a bullet describing how on_task_update interception turns streamed task updates into add_track UI commands. - Discoveries feature entry plus a multi-turn-context entry covering keep_history + auto context summarization. - "Show me artists like Radiohead" in things to try. - Two new reference patterns: parallel fan-out with streaming results, and ack-first ordering for slow tools. Also replaces all em dashes with colons / commas to match the rest of the project's writing style.
Regression: with the user manually navigated to an album page, asking "tell me about this album" or "when did Nevermind come out?" made the voice agent answer from its own training knowledge, bypassing handle_request and the UI agent's screen-grounded answer path entirely. The Absolute routing rule already covered "any question about what's on screen", but that bullet wasn't strong enough against prompts that sounded like general trivia. Adds an explicit music-domain bullet with concrete trivia-style examples plus a "do not answer music questions from training knowledge" line, and biases the "when not to call the tool" section toward delegating on uncertainty.
The reference-patterns section described the SDK's action tools as 'chainable side effects,' which referred to the older SDK design that we replaced. The SDK now ships ReplyToolMixin as a single bundled reply() tool with a required answer argument. Updates the text to describe what the SDK actually exposes today and why this app's one-domain-tool-per-turn shape diverges from it. The inline comment in ui_agent.py around the local tool definitions was already accurate; this aligns the README with it.
Two tiny fixes for cross-project consistency with the SDK: - 'Voice / UI separation-of-concerns' (with spaces) → 'Voice/UI ...' to match the existing 'Voice/UI split via task dispatch' bullet in the same file and the SDK's 'voice/UI delegation pattern'. - 'UIAgent (LLM, not bridged)' → 'UIAgent (LLM, non-bridged)' to match the SDK demos README's 'a non-bridged UIAgent' canonical phrasing.
The UI Agent Protocol wire format moved into pipecat.processors.frameworks.rtvi.models with pipecat-ai 1.2.0; subagents no longer re-exports the payload models. Update the import to match the new canonical location.
Subagents moved the BusUI* carrier classes from bus.messages to agents.ui.ui_messages so they live next to the UIAgent that uses them. Update the import to the new path.
Pipecat 1.2.0 renamed the UI Agent Protocol type strings from dot-form (ui.event, ui.task, etc.) to kebab-case (ui-event, ui-task) to match the rest of the RTVI protocol. Music-player uses the constants by name so no functional change; just brings the inline doc references up to date.
Temporary [tool.uv.sources] overrides so reviewers and CI can resolve ``pipecat-ai>=1.1.0`` and ``pipecat-ai-subagents>=0.4.0`` from the open wire-format PRs before either is published to PyPI. Companion PRs: - pipecat-ai/pipecat#4407 - pipecat-ai/pipecat-subagents#18 uv strips [tool.uv.sources] when building the distribution, so this is install-time-only and does not affect the published demo. Drop this commit (or just the [tool.uv.sources] block) before merging once both upstreams are on PyPI.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A voice-driven music browsing app built on Pipecat and pipecat-subagents. Reference implementation that exercises the full UI Agent Protocol surface against a real-world domain (live Deezer-backed music catalog).
See
README.mdfor the architecture diagram, feature list, things to try, and reference patterns.Reference patterns exercised
VoiceAgent(LLM, bridged) handles only the conversation;UIAgent(LLM, non-bridged) owns the navigation stack and screen state. Voice delegates every UI request viaasync with self.task("ui", payload={"query": query}). The UI agent completes with aspeakfield and the voice agent hands it verbatim to TTS without re-running its LLM.start_discovery(seed_artist)callsstart_user_task_group("similar_artist", "genre", "two_hop", ..., cancellable=True). Each worker streams tracks viasend_task_update(data={"kind": "track", ...}). The UI agent'son_task_updateinterception turns each into anadd_trackUI command so the Discovery grid fills as workers find tracks. Lifecycle envelopes (group_started/task_update/task_completed/group_completed) flow to the React client unchanged;useUITasks()renders the in-flight panel and Cancel button without app-specific wiring.<ui_state>: the React client callsuseA11ySnapshot()near the app root, streaming the document's a11y tree asui-snapshotRTVI messages. The UI agent stores the latest and auto-injects it as<ui_state>at the start of every task, so the LLM always reasons over the current screen.CatalogAgentis spawned as a runner peer (not per-connect), so its Deezer cache survives across clients and its expensive warm-up runs once per process.start_discoverypushes a placeholder Discovery screen and speaks the ack first, then resolves the seed, then re-emits the canonical record before firing workers. Cold catalog seeds can take several seconds; the user gets visible + audible feedback within 2-3s instead of a stalled tool.keep_history=True+ auto context summarization: music browsing is naturally multi-turn ("show me Nirvana" → "play their best album" → "skip that one"), so deictic references resolve against prior exchanges. Summarization keeps the context bounded over long sessions; old<ui_state>snapshots compress into a system summary while the most recent turns stay verbatim.scroll_to(ref)/highlight(ref)@toolwrappers send the UI command, complete the in-flight task with nospeak, and exit. The visual change on the client is the user-facing feedback; the voice agent stays quiet for that turn. (The SDK's bundledReplyToolMixindoesn't fit this app's "each tool call IS the whole turn" shape, so the helpers are wrapped locally instead.)What works end-to-end
<ui_state>, genre-scoped trending.Notes for reviewers
[tool.uv.sources]pins inserver/pyproject.tomlbefore merging. Commit842dff1adds a temporary[tool.uv.sources]block that resolvespipecat-aiandpipecat-ai-subagentsfrom the open wire-format PRs (feat(rtvi): add UI Agent Protocol as first-class RTVI message types pipecat-ai/pipecat#4407 and UIAgent pipecat-ai/pipecat-subagents#18). It exists so reviewers and CI can resolve the deps before those packages publish. Once both upstreams are on PyPI, drop that commit (or just the[tool.uv.sources]block) so the demo resolves from PyPI like any other consumer. The override is install-time-only — uv strips[tool.uv.sources]from the published distribution — but leaving it in the repo would mask a regression where the published packages fail to resolve.pipecat-ai/pipecat#4407— UI Agent Protocol as first-class RTVI message types (canonical wire format).pipecat-ai/pipecat-client-web#203—UIAgentClient, React idioms, standard handlers.pipecat-ai/pipecat-subagents#18—UIAgent, action helpers,start_user_task_group,attach_ui_bridge. This demo depends on it.Test plan
No automated tests; the demo is exercised end-to-end against the React client. Manually verified:
842dff1(or the[tool.uv.sources]block inserver/pyproject.toml) oncepipecat-ai1.2.0 andpipecat-ai-subagents0.4.0 are on PyPI