Use UIAgent by markbackman · Pull Request #1 · markbackman/pipecat-music-player

markbackman · 2026-04-25T01:24:18Z

Summary

A voice-driven music browsing app built on Pipecat and pipecat-subagents. Reference implementation that exercises the full UI Agent Protocol surface against a real-world domain (live Deezer-backed music catalog).

See README.md for the architecture diagram, feature list, things to try, and reference patterns.

Reference patterns exercised

Voice / UI separation of concerns: VoiceAgent (LLM, bridged) handles only the conversation; UIAgent (LLM, non-bridged) owns the navigation stack and screen state. Voice delegates every UI request via async with self.task("ui", payload={"query": query}). The UI agent completes with a speak field and the voice agent hands it verbatim to TTS without re-running its LLM.
Parallel fan-out with streaming results: start_discovery(seed_artist) calls start_user_task_group("similar_artist", "genre", "two_hop", ..., cancellable=True). Each worker streams tracks via send_task_update(data={"kind": "track", ...}). The UI agent's on_task_update interception turns each into an add_track UI command so the Discovery grid fills as workers find tracks. Lifecycle envelopes (group_started / task_update / task_completed / group_completed) flow to the React client unchanged; useUITasks() renders the in-flight panel and Cancel button without app-specific wiring.
Accessibility snapshots as <ui_state>: the React client calls useA11ySnapshot() near the app root, streaming the document's a11y tree as ui-snapshot RTVI messages. The UI agent stores the latest and auto-injects it as <ui_state> at the start of every task, so the LLM always reasons over the current screen.
Long-lived singleton agent: CatalogAgent is spawned as a runner peer (not per-connect), so its Deezer cache survives across clients and its expensive warm-up runs once per process.
Ack-first ordering for slow tools: start_discovery pushes a placeholder Discovery screen and speaks the ack first, then resolves the seed, then re-emits the canonical record before firing workers. Cold catalog seeds can take several seconds; the user gets visible + audible feedback within 2-3s instead of a stalled tool.
keep_history=True + auto context summarization: music browsing is naturally multi-turn ("show me Nirvana" → "play their best album" → "skip that one"), so deictic references resolve against prior exchanges. Summarization keeps the context bounded over long sessions; old <ui_state> snapshots compress into a system summary while the most recent turns stay verbatim.
Silent fire-and-forget action tools: local scroll_to(ref) / highlight(ref) @tool wrappers send the UI command, complete the in-flight task with no speak, and exit. The visual change on the client is the user-facing feedback; the voice agent stays quiet for that turn. (The SDK's bundled ReplyToolMixin doesn't fit this app's "each tool call IS the whole turn" shape, so the helpers are wrapped locally instead.)

What works end-to-end

Voice navigation, item selection, playback control, conversational Q&A grounded in <ui_state>, genre-scoped trending.
Multi-turn deixis ("play that one", "more like them", "the first one").
Discoveries flow with all three workers contributing tracks, in-flight panel, cancel support.

Notes for reviewers

⚠️ MERGE BLOCKER — revert the [tool.uv.sources] pins in server/pyproject.toml before merging. Commit 842dff1 adds a temporary [tool.uv.sources] block that resolves pipecat-ai and pipecat-ai-subagents from the open wire-format PRs (feat(rtvi): add UI Agent Protocol as first-class RTVI message types pipecat-ai/pipecat#4407 and UIAgent pipecat-ai/pipecat-subagents#18). It exists so reviewers and CI can resolve the deps before those packages publish. Once both upstreams are on PyPI, drop that commit (or just the [tool.uv.sources] block) so the demo resolves from PyPI like any other consumer. The override is install-time-only — uv strips [tool.uv.sources] from the published distribution — but leaving it in the repo would mask a regression where the published packages fail to resolve.
Companion PRs land the wire format on each side:
- pipecat-ai/pipecat#4407 — UI Agent Protocol as first-class RTVI message types (canonical wire format).
- pipecat-ai/pipecat-client-web#203 — UIAgentClient, React idioms, standard handlers.
- pipecat-ai/pipecat-subagents#18 — UIAgent, action helpers, start_user_task_group, attach_ui_bridge. This demo depends on it.

Test plan

No automated tests; the demo is exercised end-to-end against the React client. Manually verified:

Voice navigation, item resolution, playback (the prompts in the README's "Things to try" section)
Discoveries with all three workers contributing tracks; in-flight panel + Cancel
Multi-turn dialog with deictic references across turns
Long-session sanity check (auto-summarization fires; conversation stays coherent)
Reviewer pulls down, runs server + client per the README, hits a few of the prompts
Before merge: drop commit 842dff1 (or the [tool.uv.sources] block in server/pyproject.toml) once pipecat-ai 1.2.0 and pipecat-ai-subagents 0.4.0 are on PyPI

App root calls useA11ySnapshot so the client streams accessibility snapshots to the server on mount, DOM mutations, focus, scrollend, resize, and visibility change. Grid component adds role=grid + aria-colcount so the agent can resolve position references ("top right", "the first one") directly from [cols=N] in <ui_state>. Server: on_task_request injects the latest snapshot just-in-time, so the LLM always reasons over the current screen. Removed the stale _inject_ui_update calls from _emit_* and _respond, which had been re-injecting prose descriptions into the UI agent's own context one tick before the client re-rendered. UI agent prompt updated to describe the Playwright-MCP tree format and the [offscreen] / [cols=N] semantics.

UIAgent base now auto-injects <ui_state> on task request, so the music-player override no longer calls inject_ui_state() manually. System prompt's UI-context section replaced with the SDK-exported UI_STATE_PROMPT_GUIDE constant, so format updates flow through with the pipecat-subagents version. Deletes the spike-only helpers: _debug_snapshot, _inject_ui_update, and the four _describe_*_screen methods (home/artist/detail/trending) plus the _describe_grid helper. Tool return values for navigate_to_artist and select_item now use short strings that the voice agent paraphrases for confirmation; positional context lives in <ui_state> alone. Removes log_snapshots=True from the UIAgent constructor call (flag no longer exists on the base class).

When the voice agent dispatches ``answer_about_music("what's the best song on this album?", about="Starboy")``, the inner LLM call now sees which album the question is about. Previously the ``about`` field was only used for toast rendering and never reached the prompt, so the inner LLM asked the user to clarify even when the UI agent had already resolved "this album" correctly. - ``descriptions.answer_question`` takes ``about`` and optional ``about_tracks`` kwargs. The prompt templates gain a ``{focus_section}`` slot that renders a "Focus item" line plus, for albums, the tracklist. The inner LLM uses this to resolve deixis and reason about track-level questions. - ``_answer_question`` resolves the referenced album via the new ``_resolve_about_tracks`` helper. It prefers the already-cached tracks on the artist's album record (populated by ``_emit_detail``) and falls back to fetching from the catalog on demand.

Exercises the two SDK capabilities that weren't yet showcased: the LLM can now scroll offscreen items into view and visually point at elements by ref. Server (``server/ui_agent.py``) - ``UIAgent`` inherits ``ScrollToToolMixin`` and ``HighlightToolMixin`` alongside the base ``UIAgent``. MRO picks up both tools so the LLM sees ``scroll_to(ref)`` and ``highlight(ref)``. - System prompt documents when to call each: ``scroll_to`` for ``[offscreen]`` elements, ``highlight`` for "point at / which one is X" turns. - New ``_seed_demo_favorites()`` populates the Favorites grid at init with Radiohead's "In Rainbows" and Bad Bunny's "DeBÍ TiRAR MáS FOToS" so "scroll to my favorites" lands on content instead of the empty-state placeholder. Client - ``useServerMessages`` replaces the hand-rolled ``[data-scroll-target=...]`` handler with ``useStandardScrollToHandler({ block: "center", container: () => document.querySelector(".main") })``. The standard handler resolves ref first (what the LLM sends) and scrolls inside the overflow container so the sticky header is cleared. - Adds ``useStandardHighlightHandler({ className: "ui-highlight", defaultDurationMs: 2000, scrollIntoViewFirst: true })``. Offscreen targets auto-scroll into view before flashing. - ``.ui-highlight`` CSS: 3px gold ring with a glow + a subtle 3% scale pulse, fades over 2s. Bright enough to read on a shared screen. Demo script (6 turns, each hits a distinct capability): 1. "Show me Radiohead" — nav (existing) 2. "Which album is OK Computer?" — highlight 3. "Play the last song on this album" — scroll_to → play 4. "Go back" — nav (existing) 5. "Which one is The Weeknd?" — highlight on Trending 6. "Scroll to my favorites" — scroll_to (seeded content lands)

Drops the per-app task-tracking boilerplate now that the SDK's UIAgent records the in-flight task and exposes respond_to_task(). _respond is now a thin wrapper that sets the music-player's `description` field and delegates to respond_to_task. The local scroll_to/highlight overrides are gone; the SDK mixin tools complete the task silently and the voice agent has a third branch that emits no TTS when the response is empty. Also collapses answer_about_catalog/answer_about_music into a single answer(text, about=None) tool that writes the spoken reply inline, grounded by <ui_state> and the model's training knowledge. Removes the now-unused answer_question helper and prompts from descriptions.py.

…er())

…highlight The SDK action mixins (ScrollToToolMixin, HighlightToolMixin) are now pure chainable side effects: they dispatch a UI command and leave the task open so the LLM can chain another tool in the same turn. This app's prompt is "exactly one tool per turn", so the chainable shape doesn't fit. Drop the mixin imports and define scroll_to / highlight locally as silent terminators: send_command(...) + respond_to_task() + result_callback(None). Behavior is unchanged from before the SDK refactor; the visual change on the client is still the user-facing feedback and the voice agent stays quiet for that turn. README updated to reflect the local definition.

@tool

The SDK now exposes scroll_to(ref) and highlight(ref) as plain instance methods on UIAgent (wrapping send_command + the standard payload dataclasses). The local @tool overrides here can delegate to them via super(), dropping the direct send_command + dataclass construction. Drop the unused Highlight import (no longer referenced in this file). ScrollTo stays — it's still used elsewhere. The "one tool per turn" design unchanged: scroll_to and highlight remain @tool methods that call respond_to_task() to silently terminate. ReplyToolMixin is not composed here because this app's tool surface is many distinct named tools, not the canonical answer + visuals bundle.

Music browsing is multi-turn by nature. The user says "show me Nirvana" → "play their best album" → "skip that one and try the next" — each follow-up references something from the prior exchange. With keep_history=False the LLM saw only the current <ui_state> per turn and couldn't ground these references. Switch the UI agent to keep_history=True so it accumulates conversation history. To bound growth over long sessions, enable auto context summarization on the assistant aggregator (LLMAssistantAggregatorParams.enable_auto_context_summarization= True). Thresholds: 8000 tokens or 20 unsummarized messages trigger a summary; the summary targets 6000 tokens and keeps the last 4 messages verbatim so the most recent dialog stays intact for the model. The aggregator handles summarization internally — the LLM service that already runs on the agent generates the summary inline when the request frame fires. No pipeline override or extra processor needed; the right hook was the assistant aggregator's params, which LLMContextAgent already accepts as a constructor kwarg. Adds an on_summary_applied event handler in on_ready() that logs when a summary is applied, with before/after message counts. Useful for understanding session dynamics in long traces.

@tool

…y screen User says "find me music like Radiohead" and the UI agent fans out to three worker recommenders in parallel. Each worker pulls candidate artists from a different angle, fetches their top tracks through the existing CatalogAgent, and streams tracks back as they arrive. The UI agent translates each streamed track into an add_track UI command, which the client renders into a new Discovery screen. Server-side pieces: - discovery_workers.py: three BaseAgent subclasses (SimilarArtistRecommender, GenreRecommender, ChartRecommender), each overriding find_candidate_artists() to pull candidates from a different catalog action (related_artists, get_trending(genre), get_trending(None)). The base streams tracks via send_task_update with kind="track" so they arrive incrementally. - ui_agent.py: - new start_discovery(seed_artist) @tool that resolves the seed, pushes a Discovery NavFrame, and fires start_user_task_group(...) against the three worker names — fire-and-forget so the voice agent unblocks while workers run. - on_task_update interception: when a registered discovery group streams a track update, emit a custom add_track UI command carrying the track payload + source name. Other update kinds (free-form progress text) flow through unchanged via the standard task lifecycle forwarding. - new "discovery" Screen value + Discovery NavFrame fields (seed id + name) + _emit_discovery() that pushes the screen + dispatch in _emit_for_top so reconnects re-emit correctly. - new @on_ui_event("track_click") + _handle_discovery_track_click that resolves the artist, finds the song in its catalog record, and plays via _do_play. Re-clicking the active track stops playback (parity with the existing play_track handler). - prompt update: documents start_discovery so the LLM knows when to reach for it. - bot.py: registers the three workers alongside voice + ui in on_client_ready, so they're available as task targets when start_user_task_group dispatches. Workers don't talk to Deezer directly. Every catalog lookup goes through the long-lived CatalogAgent, the same data layer the rest of the app uses, so caching + rate-limiting stay centralized. This adds a real demonstration of start_user_task_group + the four ui.task envelopes + cancellation in a production-shaped app on top of real Deezer data — the SDK feature with the biggest gap in music-player's prior coverage. Client-side pieces (Discovery screen + add_track command handler + track_click event emit) follow in the next commit.

Companion to the server-side discoveries flow. The client now renders a Discovery screen, accumulates streamed tracks, exposes a Cancel button driven by useUITasks, and emits track_click events on card press. Type changes (types.ts): - New "discovery" Screen variant with seedArtist + backEnabled. - New DiscoveryTrack interface mirroring the server's add_track payload (with a "source" field naming the worker that surfaced it: similar_artist / genre / chart). - New "track_click" ClickEvent variant for discovery card clicks. State (hooks/useServerMessages.ts): - discoveryTracks: DiscoveryTrack[] accumulated from add_track custom commands. Deduped by track id (workers occasionally surface the same track from different angles). Cleared on a new screen=discovery push so each session starts empty. - "discovery" branch of the screen command handler stores the seed artist + clears discoveryTracks for the new session. - New AddTrackPayload type + useUICommandHandler<AddTrackPayload> registration for the add_track command. Screen component (screens/Discovery.tsx): - Reads in-flight task group state via useUITasks() (the React UITasksProvider subscribes to the SDK's ui.task envelopes). Picks the most recent group whose label starts with "Discoveries:" as the active session. Per-worker rows show running/completed/ cancelled status + the latest update text. Cancel button calls cancelTask(taskId). - Track grid fills from props as add_track commands arrive. Cards show cover, title, artist, and a per-source label. Click → fires the new track_click event. App.tsx: - Wraps the existing UIAgentProvider in UITasksProvider so useUITasks works. - Renders Discovery screen for screen.kind === "discovery". - Threads discoveryTracks down as a prop. - Includes "discovery" in the back-enabled and screenIdentity switches so the existing scroll-reset and back-button behaviors cover it too. Styles (index.css): self-contained .discovery-* classes for the header, in-flight panel, worker rows, track grid, and per-source accent colors.

- Workers now all answer "what's like X": drop chart recommender, add two-hop (related-of-related), fix genre worker by reading genre_id off the artist's releases (Deezer's top-tracks endpoint strips the album subobject, so my earlier path returned None for everyone). - Cap concurrent Deezer requests with a module-level semaphore so fan-out + warm-up can't trip the IP rate limit; raise per-task catalog timeout to 15s to absorb queueing during cold cascades. - Drop the heavy home warm-up. Build artist records lazily on click so user-initiated work owns the rate-limit budget. Cold home click is now ~1-3s; cold discovery is ~3-5s instead of ~12s. - start_discovery pushes a placeholder Discovery screen and speaks the ack first, resolves the seed second, fires workers third. The user gets visible+audible feedback within 2-3s instead of 13s of silence on a cold seed. - Tighten the UI agent prompt so a named seed routes to start_discovery; previously "show me artists like Radiohead" fell through to navigate_to_artist. - Grid goes to 8 columns; pull 3 tracks per artist for fuller fills; rename pill labels to "Genre peers" and "More like these".

Architecture diagram, features list, and reference patterns all ignored discovery; "things to try" was missing the new prompt. Adds: - Two-pattern framing in the opening (voice/UI separation + parallel fan-out). - Discovery workers branch in the architecture diagram, plus a bullet describing how on_task_update interception turns streamed task updates into add_track UI commands. - Discoveries feature entry plus a multi-turn-context entry covering keep_history + auto context summarization. - "Show me artists like Radiohead" in things to try. - Two new reference patterns: parallel fan-out with streaming results, and ack-first ordering for slow tools. Also replaces all em dashes with colons / commas to match the rest of the project's writing style.

Regression: with the user manually navigated to an album page, asking "tell me about this album" or "when did Nevermind come out?" made the voice agent answer from its own training knowledge, bypassing handle_request and the UI agent's screen-grounded answer path entirely. The Absolute routing rule already covered "any question about what's on screen", but that bullet wasn't strong enough against prompts that sounded like general trivia. Adds an explicit music-domain bullet with concrete trivia-style examples plus a "do not answer music questions from training knowledge" line, and biases the "when not to call the tool" section toward delegating on uncertainty.

The reference-patterns section described the SDK's action tools as 'chainable side effects,' which referred to the older SDK design that we replaced. The SDK now ships ReplyToolMixin as a single bundled reply() tool with a required answer argument. Updates the text to describe what the SDK actually exposes today and why this app's one-domain-tool-per-turn shape diverges from it. The inline comment in ui_agent.py around the local tool definitions was already accurate; this aligns the README with it.

Two tiny fixes for cross-project consistency with the SDK: - 'Voice / UI separation-of-concerns' (with spaces) → 'Voice/UI ...' to match the existing 'Voice/UI split via task dispatch' bullet in the same file and the SDK's 'voice/UI delegation pattern'. - 'UIAgent (LLM, not bridged)' → 'UIAgent (LLM, non-bridged)' to match the SDK demos README's 'a non-bridged UIAgent' canonical phrasing.

The UI Agent Protocol wire format moved into pipecat.processors.frameworks.rtvi.models with pipecat-ai 1.2.0; subagents no longer re-exports the payload models. Update the import to match the new canonical location.

Subagents moved the BusUI* carrier classes from bus.messages to agents.ui.ui_messages so they live next to the UIAgent that uses them. Update the import to the new path.

Pipecat 1.2.0 renamed the UI Agent Protocol type strings from dot-form (ui.event, ui.task, etc.) to kebab-case (ui-event, ui-task) to match the rest of the RTVI protocol. Music-player uses the constants by name so no functional change; just brings the inline doc references up to date.

Temporary [tool.uv.sources] overrides so reviewers and CI can resolve ``pipecat-ai>=1.1.0`` and ``pipecat-ai-subagents>=0.4.0`` from the open wire-format PRs before either is published to PyPI. Companion PRs: - pipecat-ai/pipecat#4407 - pipecat-ai/pipecat-subagents#18 uv strips [tool.uv.sources] when building the distribution, so this is install-time-only and does not affect the published demo. Drop this commit (or just the [tool.uv.sources] block) before merging once both upstreams are on PyPI.

markbackman added 15 commits May 1, 2026 17:51

music-player: introduce voice/UI separation with screen + click protocol

a0b8d6e

README: reflect new UIAgent surface (a11y snapshot, mixin tools, answ…

d1bc41d

…er())

markbackman force-pushed the mb/ui-agent branch from 7811865 to feaac89 Compare May 1, 2026 21:54

markbackman added 2 commits May 1, 2026 18:00

markbackman marked this pull request as ready for review May 1, 2026 22:07

markbackman added 2 commits May 2, 2026 09:07

Source UI command payloads (Toast, ScrollTo) directly from pipecat

08ca596

The UI Agent Protocol wire format moved into pipecat.processors.frameworks.rtvi.models with pipecat-ai 1.2.0; subagents no longer re-exports the payload models. Update the import to match the new canonical location.

Source BusUIEventMessage from agents.ui.ui_messages

18ee370

Subagents moved the BusUI* carrier classes from bus.messages to agents.ui.ui_messages so they live next to the UIAgent that uses them. Update the import to the new path.

markbackman force-pushed the mb/ui-agent branch from 26fb80e to f526ef2 Compare May 2, 2026 14:24

markbackman changed the title ~~UI agent POC~~ Use UIAgent May 2, 2026

markbackman added 2 commits May 2, 2026 12:37

markbackman force-pushed the mb/ui-agent branch from f526ef2 to 842dff1 Compare May 2, 2026 16:53

Update for latest RTVI protocol

53d5f69

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use UIAgent#1

Use UIAgent#1
markbackman wants to merge 22 commits into
mainfrom
mb/ui-agent

markbackman commented Apr 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

markbackman commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Reference patterns exercised

What works end-to-end

Notes for reviewers

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

markbackman commented Apr 25, 2026 •

edited

Loading