Skip to content

Use UIAgent#1

Open
markbackman wants to merge 22 commits into
mainfrom
mb/ui-agent
Open

Use UIAgent#1
markbackman wants to merge 22 commits into
mainfrom
mb/ui-agent

Conversation

@markbackman
Copy link
Copy Markdown
Owner

@markbackman markbackman commented Apr 25, 2026

Summary

A voice-driven music browsing app built on Pipecat and pipecat-subagents. Reference implementation that exercises the full UI Agent Protocol surface against a real-world domain (live Deezer-backed music catalog).

See README.md for the architecture diagram, feature list, things to try, and reference patterns.

Reference patterns exercised

  • Voice / UI separation of concerns: VoiceAgent (LLM, bridged) handles only the conversation; UIAgent (LLM, non-bridged) owns the navigation stack and screen state. Voice delegates every UI request via async with self.task("ui", payload={"query": query}). The UI agent completes with a speak field and the voice agent hands it verbatim to TTS without re-running its LLM.
  • Parallel fan-out with streaming results: start_discovery(seed_artist) calls start_user_task_group("similar_artist", "genre", "two_hop", ..., cancellable=True). Each worker streams tracks via send_task_update(data={"kind": "track", ...}). The UI agent's on_task_update interception turns each into an add_track UI command so the Discovery grid fills as workers find tracks. Lifecycle envelopes (group_started / task_update / task_completed / group_completed) flow to the React client unchanged; useUITasks() renders the in-flight panel and Cancel button without app-specific wiring.
  • Accessibility snapshots as <ui_state>: the React client calls useA11ySnapshot() near the app root, streaming the document's a11y tree as ui-snapshot RTVI messages. The UI agent stores the latest and auto-injects it as <ui_state> at the start of every task, so the LLM always reasons over the current screen.
  • Long-lived singleton agent: CatalogAgent is spawned as a runner peer (not per-connect), so its Deezer cache survives across clients and its expensive warm-up runs once per process.
  • Ack-first ordering for slow tools: start_discovery pushes a placeholder Discovery screen and speaks the ack first, then resolves the seed, then re-emits the canonical record before firing workers. Cold catalog seeds can take several seconds; the user gets visible + audible feedback within 2-3s instead of a stalled tool.
  • keep_history=True + auto context summarization: music browsing is naturally multi-turn ("show me Nirvana" → "play their best album" → "skip that one"), so deictic references resolve against prior exchanges. Summarization keeps the context bounded over long sessions; old <ui_state> snapshots compress into a system summary while the most recent turns stay verbatim.
  • Silent fire-and-forget action tools: local scroll_to(ref) / highlight(ref) @tool wrappers send the UI command, complete the in-flight task with no speak, and exit. The visual change on the client is the user-facing feedback; the voice agent stays quiet for that turn. (The SDK's bundled ReplyToolMixin doesn't fit this app's "each tool call IS the whole turn" shape, so the helpers are wrapped locally instead.)

What works end-to-end

  • Voice navigation, item selection, playback control, conversational Q&A grounded in <ui_state>, genre-scoped trending.
  • Multi-turn deixis ("play that one", "more like them", "the first one").
  • Discoveries flow with all three workers contributing tracks, in-flight panel, cancel support.

Notes for reviewers

  • ⚠️ MERGE BLOCKER — revert the [tool.uv.sources] pins in server/pyproject.toml before merging. Commit 842dff1 adds a temporary [tool.uv.sources] block that resolves pipecat-ai and pipecat-ai-subagents from the open wire-format PRs (feat(rtvi): add UI Agent Protocol as first-class RTVI message types pipecat-ai/pipecat#4407 and UIAgent pipecat-ai/pipecat-subagents#18). It exists so reviewers and CI can resolve the deps before those packages publish. Once both upstreams are on PyPI, drop that commit (or just the [tool.uv.sources] block) so the demo resolves from PyPI like any other consumer. The override is install-time-only — uv strips [tool.uv.sources] from the published distribution — but leaving it in the repo would mask a regression where the published packages fail to resolve.
  • Companion PRs land the wire format on each side:

Test plan

No automated tests; the demo is exercised end-to-end against the React client. Manually verified:

  • Voice navigation, item resolution, playback (the prompts in the README's "Things to try" section)
  • Discoveries with all three workers contributing tracks; in-flight panel + Cancel
  • Multi-turn dialog with deictic references across turns
  • Long-session sanity check (auto-summarization fires; conversation stays coherent)
  • Reviewer pulls down, runs server + client per the README, hits a few of the prompts
  • Before merge: drop commit 842dff1 (or the [tool.uv.sources] block in server/pyproject.toml) once pipecat-ai 1.2.0 and pipecat-ai-subagents 0.4.0 are on PyPI

markbackman added 15 commits May 1, 2026 17:51
App root calls useA11ySnapshot so the client streams accessibility
snapshots to the server on mount, DOM mutations, focus, scrollend,
resize, and visibility change. Grid component adds role=grid +
aria-colcount so the agent can resolve position references ("top
right", "the first one") directly from [cols=N] in <ui_state>.

Server: on_task_request injects the latest snapshot just-in-time,
so the LLM always reasons over the current screen. Removed the
stale _inject_ui_update calls from _emit_* and _respond, which had
been re-injecting prose descriptions into the UI agent's own
context one tick before the client re-rendered. UI agent prompt
updated to describe the Playwright-MCP tree format and the
[offscreen] / [cols=N] semantics.
UIAgent base now auto-injects <ui_state> on task request, so the
music-player override no longer calls inject_ui_state() manually.

System prompt's UI-context section replaced with the SDK-exported
UI_STATE_PROMPT_GUIDE constant, so format updates flow through with
the pipecat-subagents version.

Deletes the spike-only helpers: _debug_snapshot, _inject_ui_update,
and the four _describe_*_screen methods (home/artist/detail/trending)
plus the _describe_grid helper. Tool return values for
navigate_to_artist and select_item now use short strings that the
voice agent paraphrases for confirmation; positional context lives
in <ui_state> alone. Removes log_snapshots=True from the UIAgent
constructor call (flag no longer exists on the base class).
When the voice agent dispatches ``answer_about_music("what's the best
song on this album?", about="Starboy")``, the inner LLM call now
sees which album the question is about. Previously the ``about``
field was only used for toast rendering and never reached the
prompt, so the inner LLM asked the user to clarify even when the
UI agent had already resolved "this album" correctly.

- ``descriptions.answer_question`` takes ``about`` and optional
  ``about_tracks`` kwargs. The prompt templates gain a
  ``{focus_section}`` slot that renders a "Focus item" line plus,
  for albums, the tracklist. The inner LLM uses this to resolve
  deixis and reason about track-level questions.
- ``_answer_question`` resolves the referenced album via the new
  ``_resolve_about_tracks`` helper. It prefers the already-cached
  tracks on the artist's album record (populated by ``_emit_detail``)
  and falls back to fetching from the catalog on demand.
Exercises the two SDK capabilities that weren't yet showcased:
the LLM can now scroll offscreen items into view and visually
point at elements by ref.

Server (``server/ui_agent.py``)
- ``UIAgent`` inherits ``ScrollToToolMixin`` and ``HighlightToolMixin``
  alongside the base ``UIAgent``. MRO picks up both tools so the
  LLM sees ``scroll_to(ref)`` and ``highlight(ref)``.
- System prompt documents when to call each: ``scroll_to`` for
  ``[offscreen]`` elements, ``highlight`` for "point at / which one
  is X" turns.
- New ``_seed_demo_favorites()`` populates the Favorites grid at
  init with Radiohead's "In Rainbows" and Bad Bunny's "DeBÍ TiRAR
  MáS FOToS" so "scroll to my favorites" lands on content instead
  of the empty-state placeholder.

Client
- ``useServerMessages`` replaces the hand-rolled
  ``[data-scroll-target=...]`` handler with
  ``useStandardScrollToHandler({ block: "center", container: () =>
  document.querySelector(".main") })``. The standard handler
  resolves ref first (what the LLM sends) and scrolls inside the
  overflow container so the sticky header is cleared.
- Adds ``useStandardHighlightHandler({ className: "ui-highlight",
  defaultDurationMs: 2000, scrollIntoViewFirst: true })``. Offscreen
  targets auto-scroll into view before flashing.
- ``.ui-highlight`` CSS: 3px gold ring with a glow + a subtle 3%
  scale pulse, fades over 2s. Bright enough to read on a shared
  screen.

Demo script (6 turns, each hits a distinct capability):
1. "Show me Radiohead" — nav (existing)
2. "Which album is OK Computer?" — highlight
3. "Play the last song on this album" — scroll_to → play
4. "Go back" — nav (existing)
5. "Which one is The Weeknd?" — highlight on Trending
6. "Scroll to my favorites" — scroll_to (seeded content lands)
Drops the per-app task-tracking boilerplate now that the SDK's UIAgent
records the in-flight task and exposes respond_to_task(). _respond is
now a thin wrapper that sets the music-player's `description` field
and delegates to respond_to_task. The local scroll_to/highlight
overrides are gone; the SDK mixin tools complete the task silently
and the voice agent has a third branch that emits no TTS when the
response is empty.

Also collapses answer_about_catalog/answer_about_music into a single
answer(text, about=None) tool that writes the spoken reply inline,
grounded by <ui_state> and the model's training knowledge. Removes
the now-unused answer_question helper and prompts from descriptions.py.
…highlight

The SDK action mixins (ScrollToToolMixin, HighlightToolMixin) are now
pure chainable side effects: they dispatch a UI command and leave the
task open so the LLM can chain another tool in the same turn.

This app's prompt is "exactly one tool per turn", so the chainable
shape doesn't fit. Drop the mixin imports and define scroll_to /
highlight locally as silent terminators: send_command(...) +
respond_to_task() + result_callback(None). Behavior is unchanged from
before the SDK refactor; the visual change on the client is still the
user-facing feedback and the voice agent stays quiet for that turn.

README updated to reflect the local definition.
The SDK now exposes scroll_to(ref) and highlight(ref) as plain
instance methods on UIAgent (wrapping send_command + the standard
payload dataclasses). The local @tool overrides here can delegate
to them via super(), dropping the direct send_command + dataclass
construction.

Drop the unused Highlight import (no longer referenced in this
file). ScrollTo stays — it's still used elsewhere.

The "one tool per turn" design unchanged: scroll_to and highlight
remain @tool methods that call respond_to_task() to silently
terminate. ReplyToolMixin is not composed here because this app's
tool surface is many distinct named tools, not the canonical
answer + visuals bundle.
Music browsing is multi-turn by nature. The user says "show me
Nirvana" → "play their best album" → "skip that one and try the
next" — each follow-up references something from the prior
exchange. With keep_history=False the LLM saw only the current
<ui_state> per turn and couldn't ground these references.

Switch the UI agent to keep_history=True so it accumulates
conversation history. To bound growth over long sessions, enable
auto context summarization on the assistant aggregator
(LLMAssistantAggregatorParams.enable_auto_context_summarization=
True). Thresholds: 8000 tokens or 20 unsummarized messages
trigger a summary; the summary targets 6000 tokens and keeps the
last 4 messages verbatim so the most recent dialog stays intact
for the model.

The aggregator handles summarization internally — the LLM service
that already runs on the agent generates the summary inline when
the request frame fires. No pipeline override or extra processor
needed; the right hook was the assistant aggregator's params,
which LLMContextAgent already accepts as a constructor kwarg.

Adds an on_summary_applied event handler in on_ready() that logs
when a summary is applied, with before/after message counts.
Useful for understanding session dynamics in long traces.
…y screen

User says "find me music like Radiohead" and the UI agent fans out
to three worker recommenders in parallel. Each worker pulls
candidate artists from a different angle, fetches their top tracks
through the existing CatalogAgent, and streams tracks back as they
arrive. The UI agent translates each streamed track into an
add_track UI command, which the client renders into a new
Discovery screen.

Server-side pieces:

- discovery_workers.py: three BaseAgent subclasses
  (SimilarArtistRecommender, GenreRecommender, ChartRecommender),
  each overriding find_candidate_artists() to pull candidates from
  a different catalog action (related_artists, get_trending(genre),
  get_trending(None)). The base streams tracks via send_task_update
  with kind="track" so they arrive incrementally.

- ui_agent.py:
  - new start_discovery(seed_artist) @tool that resolves the seed,
    pushes a Discovery NavFrame, and fires
    start_user_task_group(...) against the three worker names —
    fire-and-forget so the voice agent unblocks while workers run.
  - on_task_update interception: when a registered discovery group
    streams a track update, emit a custom add_track UI command
    carrying the track payload + source name. Other update kinds
    (free-form progress text) flow through unchanged via the
    standard task lifecycle forwarding.
  - new "discovery" Screen value + Discovery NavFrame fields
    (seed id + name) + _emit_discovery() that pushes the screen +
    dispatch in _emit_for_top so reconnects re-emit correctly.
  - new @on_ui_event("track_click") + _handle_discovery_track_click
    that resolves the artist, finds the song in its catalog
    record, and plays via _do_play. Re-clicking the active track
    stops playback (parity with the existing play_track handler).
  - prompt update: documents start_discovery so the LLM knows when
    to reach for it.

- bot.py: registers the three workers alongside voice + ui in
  on_client_ready, so they're available as task targets when
  start_user_task_group dispatches.

Workers don't talk to Deezer directly. Every catalog lookup goes
through the long-lived CatalogAgent, the same data layer the rest
of the app uses, so caching + rate-limiting stay centralized.

This adds a real demonstration of start_user_task_group + the four
ui.task envelopes + cancellation in a production-shaped app on top
of real Deezer data — the SDK feature with the biggest gap in
music-player's prior coverage.

Client-side pieces (Discovery screen + add_track command handler +
track_click event emit) follow in the next commit.
Companion to the server-side discoveries flow. The client now
renders a Discovery screen, accumulates streamed tracks, exposes a
Cancel button driven by useUITasks, and emits track_click events
on card press.

Type changes (types.ts):

- New "discovery" Screen variant with seedArtist + backEnabled.
- New DiscoveryTrack interface mirroring the server's add_track
  payload (with a "source" field naming the worker that surfaced
  it: similar_artist / genre / chart).
- New "track_click" ClickEvent variant for discovery card clicks.

State (hooks/useServerMessages.ts):

- discoveryTracks: DiscoveryTrack[] accumulated from add_track
  custom commands. Deduped by track id (workers occasionally
  surface the same track from different angles). Cleared on a new
  screen=discovery push so each session starts empty.
- "discovery" branch of the screen command handler stores the seed
  artist + clears discoveryTracks for the new session.
- New AddTrackPayload type + useUICommandHandler<AddTrackPayload>
  registration for the add_track command.

Screen component (screens/Discovery.tsx):

- Reads in-flight task group state via useUITasks() (the React
  UITasksProvider subscribes to the SDK's ui.task envelopes). Picks
  the most recent group whose label starts with "Discoveries:" as
  the active session. Per-worker rows show running/completed/
  cancelled status + the latest update text. Cancel button calls
  cancelTask(taskId).
- Track grid fills from props as add_track commands arrive. Cards
  show cover, title, artist, and a per-source label. Click → fires
  the new track_click event.

App.tsx:

- Wraps the existing UIAgentProvider in UITasksProvider so
  useUITasks works.
- Renders Discovery screen for screen.kind === "discovery".
- Threads discoveryTracks down as a prop.
- Includes "discovery" in the back-enabled and screenIdentity
  switches so the existing scroll-reset and back-button behaviors
  cover it too.

Styles (index.css): self-contained .discovery-* classes for the
header, in-flight panel, worker rows, track grid, and per-source
accent colors.
- Workers now all answer "what's like X": drop chart recommender, add
  two-hop (related-of-related), fix genre worker by reading genre_id
  off the artist's releases (Deezer's top-tracks endpoint strips the
  album subobject, so my earlier path returned None for everyone).
- Cap concurrent Deezer requests with a module-level semaphore so
  fan-out + warm-up can't trip the IP rate limit; raise per-task
  catalog timeout to 15s to absorb queueing during cold cascades.
- Drop the heavy home warm-up. Build artist records lazily on click
  so user-initiated work owns the rate-limit budget. Cold home click
  is now ~1-3s; cold discovery is ~3-5s instead of ~12s.
- start_discovery pushes a placeholder Discovery screen and speaks
  the ack first, resolves the seed second, fires workers third. The
  user gets visible+audible feedback within 2-3s instead of 13s of
  silence on a cold seed.
- Tighten the UI agent prompt so a named seed routes to
  start_discovery; previously "show me artists like Radiohead" fell
  through to navigate_to_artist.
- Grid goes to 8 columns; pull 3 tracks per artist for fuller fills;
  rename pill labels to "Genre peers" and "More like these".
Architecture diagram, features list, and reference patterns all
ignored discovery; "things to try" was missing the new prompt. Adds:

- Two-pattern framing in the opening (voice/UI separation + parallel
  fan-out).
- Discovery workers branch in the architecture diagram, plus a
  bullet describing how on_task_update interception turns streamed
  task updates into add_track UI commands.
- Discoveries feature entry plus a multi-turn-context entry covering
  keep_history + auto context summarization.
- "Show me artists like Radiohead" in things to try.
- Two new reference patterns: parallel fan-out with streaming
  results, and ack-first ordering for slow tools.

Also replaces all em dashes with colons / commas to match the rest
of the project's writing style.
Regression: with the user manually navigated to an album page,
asking "tell me about this album" or "when did Nevermind come out?"
made the voice agent answer from its own training knowledge,
bypassing handle_request and the UI agent's screen-grounded answer
path entirely.

The Absolute routing rule already covered "any question about
what's on screen", but that bullet wasn't strong enough against
prompts that sounded like general trivia. Adds an explicit
music-domain bullet with concrete trivia-style examples plus a
"do not answer music questions from training knowledge" line, and
biases the "when not to call the tool" section toward delegating
on uncertainty.
The reference-patterns section described the SDK's action tools as
'chainable side effects,' which referred to the older SDK design
that we replaced. The SDK now ships ReplyToolMixin as a single
bundled reply() tool with a required answer argument. Updates the
text to describe what the SDK actually exposes today and why this
app's one-domain-tool-per-turn shape diverges from it.

The inline comment in ui_agent.py around the local tool
definitions was already accurate; this aligns the README with it.
Two tiny fixes for cross-project consistency with the SDK:

- 'Voice / UI separation-of-concerns' (with spaces) → 'Voice/UI ...'
  to match the existing 'Voice/UI split via task dispatch' bullet
  in the same file and the SDK's 'voice/UI delegation pattern'.
- 'UIAgent (LLM, not bridged)' → 'UIAgent (LLM, non-bridged)' to
  match the SDK demos README's 'a non-bridged UIAgent' canonical
  phrasing.
@markbackman markbackman marked this pull request as ready for review May 1, 2026 22:07
The UI Agent Protocol wire format moved into
pipecat.processors.frameworks.rtvi.models with pipecat-ai 1.2.0;
subagents no longer re-exports the payload models. Update the
import to match the new canonical location.
Subagents moved the BusUI* carrier classes from bus.messages to
agents.ui.ui_messages so they live next to the UIAgent that uses
them. Update the import to the new path.
@markbackman markbackman changed the title UI agent POC Use UIAgent May 2, 2026
Pipecat 1.2.0 renamed the UI Agent Protocol type strings from
dot-form (ui.event, ui.task, etc.) to kebab-case (ui-event,
ui-task) to match the rest of the RTVI protocol. Music-player
uses the constants by name so no functional change; just
brings the inline doc references up to date.
Temporary [tool.uv.sources] overrides so reviewers and CI can resolve
``pipecat-ai>=1.1.0`` and ``pipecat-ai-subagents>=0.4.0`` from the open
wire-format PRs before either is published to PyPI.

Companion PRs:
- pipecat-ai/pipecat#4407
- pipecat-ai/pipecat-subagents#18

uv strips [tool.uv.sources] when building the distribution, so this
is install-time-only and does not affect the published demo.

Drop this commit (or just the [tool.uv.sources] block) before
merging once both upstreams are on PyPI.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant