UIAgent#18
Conversation
74c6a39 to
8208302
Compare
5189cd9 to
1394ec4
Compare
Temporary [tool.uv.sources] overrides so reviewers and CI can resolve ``pipecat-ai>=1.1.0`` and ``pipecat-ai-subagents>=0.4.0`` from the open wire-format PRs before either is published to PyPI. Companion PRs: - pipecat-ai/pipecat#4407 - pipecat-ai/pipecat-subagents#18 uv strips [tool.uv.sources] when building the distribution, so this is install-time-only and does not affect the published demo. Drop this commit (or just the [tool.uv.sources] block) before merging once both upstreams are on PyPI.
A model that emits a non-dict entry in ``fills`` (or a non-string ref in ``highlight`` / ``click``) would have crashed the tool body before ``respond_to_task`` ran. Because UIAgent acquires the single-flight task lock in ``on_task_request`` and only releases it via ``respond_to_task`` (or the cancellation path), an unhandled exception in the tool would have stranded the lock until the voice-agent's 30s task timeout fired ``on_task_cancelled`` — 30s of UI deadlock for what's almost always a transient model hiccup. Skip non-conforming entries instead. The fix is at the LLM-input boundary (the contents of the list arguments) rather than a broad try/finally so that real bugs in the helpers still surface as exceptions. Adds three regression tests covering the non-dict ``fills``, non-string ``highlight``, and non-string ``click`` cases. Each one asserts the critical invariant: ``respond_to_task`` and ``result_callback`` still run, so the lock is released. Reported by Codex review of #18.
bf4ced9 to
ba9b84f
Compare
| ), | ||
| ) | ||
|
|
||
| async def on_task_request(self, message: BusTaskRequestMessage) -> None: |
There was a problem hiding this comment.
this could be:
@task(name="hello")
async def on_hello(self, , message: BusTaskRequestMessage) -> None:
There was a problem hiding this comment.
The call to await super().on_task_request(message) will happen automatically.
| await super().on_ready() | ||
| # Route inbound UI events (incl. the reserved snapshot event) | ||
| # at HelloAgent — the snapshot is what HelloAgent reasons over. | ||
| attach_ui_bridge(self, target="hello") |
There was a problem hiding this comment.
Calling this attach_ui_bridge is a bit weird. We already have the notion of a bridge, which is a frame processor that outputs and inputs from the bus. Maybe we can also do this as a decorator that gets automatically called when the pipeline is ready.
What about?
@ui_agent(agent="hello")
class HelloRoot(BaseAgent)
could there be multiple UI agents? I think it should be possible. If so, maybe:
@ui_agent(agents=["hello", "hello2"])
class HelloRoot(BaseAgent):
....
| # auto_inject_ui_state is on) injects ``<ui_state>``. | ||
| # Then feed the user's query into the LLM context with | ||
| # ``run_llm=True`` so the LLM actually generates. | ||
| await super().on_task_request(message) |
There was a problem hiding this comment.
Maybe we document the @task(name="xxxx") instead?
| ``run_llm=True`` the context contains exactly: current | ||
| ``<ui_state>`` + query. | ||
| """ | ||
| await self._task_lock.acquire() |
There was a problem hiding this comment.
This is assuming a UIAgent can only run one task a time. That is, it is hard for the user to run additional tasks on a UIAgent. Need to think about this, but it feels a bit limiting.
There was a problem hiding this comment.
Maybe we can create a @ui_task(name="name") decorator instead that blocks... Or maybe even more generic @task(name="hello", sync=True) meaning that this task will block other tasks with the same name. I'll think about it.
UIAgent(LLMAgent) dispatches BusUIEventMessage to @on_ui_event handlers (each runs in its own asyncio task so the bus dispatcher is never held open) and exposes send_command for server-to-client messages. attach_ui_bridge forwards RTVI client messages onto the bus and translates BusUICommandMessage back to RTVIServerMessageFrame pushed into the root agent's pipeline. Bus gains BusUIEventMessage, BusUICommandMessage, and the matching type constants; standard command payloads (Toast, Navigate, ScrollTo, Highlight, Focus) ship alongside. Events inject as <ui_event name="...">payload</ui_event> developer messages into the LLM context by default; override render_ui_event or set inject_events=False to opt out.
Adds snapshot storage and LLM-context injection for the __ui_snapshot reserved event. UIAgent.render_ui_state() produces Playwright-MCP-style indented text with stable refs, offscreen tags, and grid dimensions. inject_ui_state() queues a developer message. visible_nodes() returns the non-offscreen subset. ScrollTo/Highlight/Focus commands now accept ref alongside target_id so the server can reference nodes it saw in <ui_state>.
UIAgent now injects the latest <ui_state> snapshot into the LLM context at the start of every task request, so the agent always reasons over the current screen. Controlled by the new auto_inject_ui_state constructor option (default True). Apps that override on_task_request pick up the behavior via super(). Adds UI_STATE_PROMPT_GUIDE, a canonical prompt fragment that documents the <ui_state> and <ui_event> wire format. Apps concatenate it into their system prompt so the LLM understands the SDK-managed developer messages it receives, and we can evolve the format in one place. Removes the spike-only log_snapshots option, _log_snapshot_receipt, _previous_snapshot_root_ref, and the now-unused _count_nodes helper.
Adds an opt-in ``log_snapshots`` constructor flag that emits a ``logger.debug`` line on every accessibility snapshot received (node count, char count, rough token estimate, and the full rendered ``<ui_state>``). Defaults to off. Useful in dev / staging for eyeballing what the LLM will see before the next inject. Ships ``ScrollToToolMixin`` alongside ``UIAgent`` in a new ``ui_tools`` module. Apps that want the LLM to be able to scroll offscreen elements into view inherit the mixin, which exposes a ``scroll_to(ref)`` tool. The tool dispatches a standard ``ScrollTo`` command by ref; keeping it as a mixin (not a base-class method) means single-screen apps don't get a ``scroll_to`` tool cluttering their LLM tool list.
Mirrors the ScrollTo mixin: apps that want the LLM to be able to
point at visible elements ("which one is Radiohead?") inherit
``HighlightToolMixin`` alongside ``UIAgent``. The mixin exposes a
``highlight(ref)`` tool that dispatches the standard ``Highlight``
command by snapshot ref. The client's
``useStandardHighlightHandler`` (or a custom one) does the visual
effect.
Like the scroll mixin, keeping this as a separate mixin means
agents that don't need visual highlighting don't pay the
tool-list cost.
``test_ui_tools`` grows from 3 to 7 cases: per-mixin coverage
(expose-tool, plain-agent-doesn't, dispatch-by-ref) plus a
combined-mixin check.
Adds `current_task` tracking and a `respond_to_task(...)` helper on UIAgent so `@tool` methods can complete the in-flight task without threading the task id through every call. The shipped mixin tools (`scroll_to`, `highlight`) now dispatch the UI command, complete the task with no `speak` field, and exit silently. The visual change on the client is the user-facing feedback; apps that want spoken narration override the mixin tool and pass `speak`.
The user asks the assistant to research a topic. The UI agent spawns a background asyncio task that runs user_task_group(...) across three worker agents (Wikipedia, news, scholar). The SDK auto-forwards every task lifecycle event to the client as ui.task envelopes — group_started, task_update, task_completed, group_completed — and the client renders an in-flight card with per-worker status. The user can cancel any group via a Cancel button on the card; the SDK ships the __cancel_task event back to the agent which calls cancel_task() on the registered group. The custom @tool reply has a research_query field. When set, the tool spawns the task group via create_asyncio_task and returns immediately with the spoken acknowledgement ("Researching X now"). The voice agent isn't blocked; it can handle follow-up turns while workers run in the background. Workers are simulated BaseAgent subclasses with on_task_request that emit a few send_task_update progress messages followed by a send_task_response with a canned summary. asyncio.sleep with randomized intervals makes the streaming UI come alive without needing real data sources. This rounds out the demo arc: hello-snapshot (read), pointing (visual point), deixis (text point), form-fill (state-changing actions), async-tasks (fan-out + streaming progress + cancel). Each demo is a one- or two-line composition of the SDK primitives plus a focused page.
The canonical fire-and-forget pattern with user_task_group required
ceremony: a separate method to host the async with, a unique
asyncio task name string, and a pass body for the context manager.
The async-tasks demo shows the pattern in full:
self.create_asyncio_task(self._run_research(query), f"...")
...
async def _run_research(self, query):
async with self.user_task_group(...):
pass
Replace with one call:
await self.start_user_task_group(
"wikipedia", "news", "scholar",
payload={"query": query},
label=f"Research: {query}",
)
Returns the task_id once the group_started envelope has fired
(so the client renders immediately) and runs the context to
completion in a background asyncio task the SDK manages. The
context-manager form stays available for callers that want to
consume worker events inline (async for event in tg).
Updates the async-tasks demo to use the new helper. Drops the
_run_research method and the create_asyncio_task ceremony from the
LLM tool body; the reply tool body is now ~10 lines shorter.
…ancelled
When the client cancels an in-flight user_task_group, the group's
wait() raises TaskGroupError on __aexit__ ("user requested" or
similar reason). For the context-manager form that bubbles to the
caller as expected. For start_user_task_group's background runner
it was bubbling to the asyncio task manager and getting logged as
an unexpected exception.
Cancellation is an expected exit for fire-and-forget groups: the
client already knows because it received the group_completed
envelope. Catch TaskGroupError around the __aexit__ call and log
at debug. Other exceptions still log at warning.
Also restructures the runner so iteration exceptions are forwarded
into __aexit__ correctly (not swallowed by a finally that calls
__aexit__(None, None, None) instead).
Adds a regression test that triggers the exact path:
start_user_task_group with a slow worker, cancel, and verify the
group_completed envelope still publishes and the task group is
cleaned up — without leaking TaskGroupError to the test runner.
Combines every prior demo's pattern into one workspace where the
user reviews a draft article by voice. The user can:
- Select a paragraph and ask for review. ReviewAgent calls
start_review(answer, paragraph_ref, paragraph_text), which spawns
two specialist worker agents (clarity, tone) via
start_user_task_group. Workers stream progress to an in-flight
card. As each completes, on_task_response intercepts and emits an
add_note custom command, attaching the worker's feedback to the
paragraph as a note.
- Dictate notes by voice. Reply tool's fills + click drive the
textarea and Save button, same as form-fill.
- Ask "where does it talk about X" and the agent uses select_text
+ scroll_to to navigate, same as deixis write direction.
- Click any note in the panel. Client emits a note_click UI event;
the agent's @on_ui_event("note_click") handler dispatches
select_text to jump to the related paragraph. Round-trip
event/command pattern.
Demonstrates two patterns no prior demo touched:
- Custom UI command (add_note) registered locally on the client.
The server emits it via send_command; the client's handler
renders a note card. Apps register their own command names freely.
- Custom client-emitted event (note_click). When the user clicks a
note, the client calls ui.sendEvent("note_click", {ref}); the
agent's @on_ui_event handler reacts.
Two LLM tools coexist: reply (from ReplyToolMixin, for normal
turns) plus a custom start_review (for paragraph review kick-off).
The prompt steers the model to pick the right one. Single tool
call per turn — no chainable coordination problems.
Workers are simulated, like async-tasks: they compute simple text
metrics (word count, sentence count, presence of absolutist or
hedging words) and emit templated feedback that varies meaningfully
per paragraph. The demo is about orchestration, not real NLP.
A real app swaps the workers for LLMAgent subclasses without
changing anything else.
The article is a 6-paragraph draft with deliberately uneven
paragraphs: ¶2 too dense, ¶3 too vague, ¶4 absolutist tone, ¶5/¶6
balanced. So the workers actually have something to flag.
The form-submit handler was trying to find the ref by walking up selection.anchorNode looking for a dataset.ref attribute the walker never sets. So notes always showed up unattached. Fix: - Track lastArticleRef on every selectionchange that lands inside the article column. Filtered to article ancestors so subsequent textarea selection (when the user or agent types) doesn't overwrite it. - Use the new findRefForElement client SDK helper to resolve selection ancestors → ref. Walks up parentElement to find the closest snapshot-known container. - Submit handler reads lastArticleRef instead of trying to derive the ref from the live selection at submit time. This works for both the manual flow (user types + clicks Save) and the voice flow (agent fills textarea + clicks Save) because the textarea focus doesn't clear lastArticleRef. Result: select a paragraph, type a note (or dictate one), the note attaches to that paragraph. Click the note in the panel and the page jumps back to the paragraph it's attached to. Closes the loop between dictation and deixis. Tightens the prompt's note-flow description to reflect the actual behavior — the client tracks selection across the textarea fill, so the agent doesn't need to thread the ref through the tool call.
Two related changes from real-session testing: 1. ReviewAgent now defaults to keep_history=True so the UI agent can resolve conversational deixis. The "can we have a note for that?" case fails when keep_history=False because the UI agent's context is fresh per turn — "that" has no antecedent. With keep_history=True the UI agent sees its own prior replies and resolves the reference correctly. Constructor signature now accepts keep_history as an explicit kwarg so apps that want fresh-per-turn can override. This is the right default for multi-turn interactive apps where the user and agent work through something together (review, iterate, refine). The original keep_history=False default on UIAgent stays correct for stateless-delegate apps (pointing, form-fill, async-tasks) where each turn is "given the current screen, do X" with no carryover. 2. UI prompt now starts with a "hard rule" requiring every turn to call exactly one tool (reply or start_review). The earlier prompt described both tools but never required calling one; for open questions like "how can we improve it?" gpt-4o-mini would sometimes produce 50 tokens of plain text and the voice agent's task() would time out. The hard rule plus an explicit "general questions go through reply" decision rule prevents that. Caught in the same session: ReviewAgent's prior __init__ signature hardcoded keep_history=False with no kwarg passthrough, so call sites trying to flip it (or even pass it) would TypeError during add_agent inside on_client_ready and silently abort the rest of the handler — which left the registry without ui/clarity/tone agents and surfaced as "agents not ready within timeout" much later when the voice agent first delegated. Worth knowing that exceptions in RTVI event handlers don't propagate noisily.
Indexes the six demos in difficulty order (hello-snapshot → pointing → deixis → form-fill → async-tasks → document-review) with a one-paragraph summary of what each shows. Defers per-demo specifics to each demo's own README. Also covers shared concerns once: how to run any demo (the npm + uv pattern is the same everywhere), the API keys all demos need, and a quick map back to the SDK's public surface for readers exploring the directory cold.
The project uses changelog/<PR>.<type>.md fragments per PR (see existing 19.changed.md and 20.added.md); CHANGELOG.md gets compiled at release time. The earlier direct edit to CHANGELOG.md's [Unreleased] section short-circuited that flow. Moves the content to changelog/18.added.md and reverts CHANGELOG.md to match main.
Two correctness issues raised in code review: 1. Single _current_task slot races under concurrent dispatch. Two overlapping on_task_request calls overwrite each other's task handle, so the first task's tool calls respond_to_task() with the wrong task_id. Even if the slot were per-request (e.g. ContextVar), the agent has only one LLM context and one running pipeline; concurrent processing would still interleave context mutations and corrupt the conversation. 2. The keep_history=False reset wipes any messages pre-seeded via context= on LLMContextAgent's constructor, contradicting the inherited contract. Fixes (1) by acquiring an asyncio.Lock in on_task_request and holding it until respond_to_task fires. Concurrent submissions queue and process in arrival order. The lock release lives in respond_to_task; a tool that forgets to call it will hang the agent on the next task, which is the correct fast-surfacing signal that something is wrong (no watchdog: it would mask the bug). Fixes (2) via documentation in the keep_history docstring, calling out that persistent app instructions belong in the LLM's system_instruction setting (which lives outside the context message list and is unaffected by the reset). Adds test_concurrent_task_requests_serialize covering the overlap case.
The constructor's context= arg doc previously said it was for seeding 'an initial system prompt or message history,' which conflicts with the default keep_history=False reset behavior. Updates to call out that seeded messages are part of mutable task history and get cleared, and points readers to system_instruction for anything durable. Same clarification in reset_context()'s docstring (seeded messages ARE affected by the reset). And keep_history's note now points at UI_STATE_PROMPT_GUIDE as a concrete example of what to put in system_instruction.
BaseAgent._handle_task_cancel sends the CANCELLED response directly via send_task_response and bypasses respond_to_task, which is where the lock release lives. Without an on_task_cancelled hook the lock would stay held after a cancellation and every subsequent UI task request would block at on_task_request's acquire forever. Override on_task_cancelled to clear _current_task and release _task_lock when the cancelled task_id matches the in-flight one. Idempotent and race-safe: the current_task identity check makes it a no-op when respond_to_task fired first, and the locked() guard makes the release safe regardless of what cleared the slot. Adds two focused tests: - cancellation_releases_lock_for_subsequent_tasks: the bug Codex flagged; a follow-up task must not block. - cancellation_for_unrelated_task_id_leaves_lock_held: confirms we only react to cancels that match the current task.
The UI Agent Protocol wire format (envelope-type strings, reserved event names, task-lifecycle kind discriminators, and the built-in command payload dataclasses) now lives in pipecat.processors.frameworks.rtvi.ui as of pipecat-ai 1.2.0. Single-LLM Pipecat apps and other frameworks can now target the same wire format without taking a subagents dependency. Subagents continues to re-export the same names from pipecat_subagents.bus and pipecat_subagents.agents so existing imports keep working; the canonical definitions just moved. Also replaces the inline 'group_started' / 'task_update' / etc. string literals in the UI bridge with the new UI_TASK_*_KIND constants from pipecat.
These bus messages are subagents-internal carriers exchanged only between UIAgent and the bridge installed by attach_ui_bridge. They have no use outside the UI subpackage. Co-locating them with the agent that consumes them removes a layering wart in bus/messages.py and matches the directory shape of the rest of the UI surface. Removes the BusUI* re-exports from pipecat_subagents.bus and adds them under pipecat_subagents.agents.ui (and at agents.ui.ui_messages for direct import). All internal callers and tests updated to the new path.
Pipecat-ai 1.2.0 promotes the UI Agent Protocol to first-class RTVI message types (ui-event, ui-command, ui-snapshot, ui-cancel-task, ui-task) instead of sub-types carried inside server-message / client-message. Update the bridge to match: - attach_ui_bridge subscribes to on_ui_message on the RTVI processor (instead of on_client_message). UIEventMessage, UISnapshotMessage, and UICancelTaskMessage are translated onto the bus as BusUIEventMessage carriers; the snapshot and cancel-task carry subagents-internal event names so UIAgent's existing dispatch keeps working. - Outbound: BusUICommandMessage and the four BusUITask* messages are emitted as RTVIServerTypedMessageFrame wrapping UICommandMessage / UITaskMessage envelopes (instead of RTVIServerMessageFrame with a dict). Subagents-internal _UI_SNAPSHOT_BUS_EVENT_NAME and _UI_CANCEL_TASK_BUS_EVENT_NAME constants in agents/ui/ui_messages replace the wire-format reserved event names (which were public constants in pipecat). UIAgent dispatches on these internal names. Tests updated: bridge tests cover the new UI message inputs and the typed frame outputs; the cancel-task tests now construct UICancelTaskMessage instead of forging a ui-event with a reserved name. All 282 subagents tests pass against the local pipecat checkout.
Pipecat's RTVI now ships RTVIUICommandFrame and RTVIUITaskFrame as domain-scoped pipeline frames, mirroring how RTVIServerMessageFrame and the LLM/TTS frames work: the frame carries domain data, the observer wraps it into the matching typed RTVI envelope before sending. Switch the bridge over to push these instead of the generic RTVIServerTypedMessageFrame. The generic typed-message frame is gone from pipecat. This is a better fit with the rest of the RTVI surface: a reader doesn't have to inspect what's inside the frame to know what's being sent, the frame name itself tells them. Symmetric with how LLM events flow (LLMFunctionCallStartedFrame produces an llm-function-call-started envelope inside the observer).
The constant is going away on the pipecat side (it was redundant with the Literal[...] field default on UICommandMessage). Drops the import and the corresponding test assertion, and trims the matching mention from the changelog fragment.
Temporary [tool.uv.sources] override so reviewers and CI can resolve ``pipecat-ai>=1.2.0`` from the open wire-format PR before pipecat 1.2.0 ships on PyPI. uv strips [tool.uv.sources] when building the distribution, so this is install-time-only and does not affect the published package. Companion PR: pipecat-ai/pipecat#4407 Drop this commit (or just the [tool.uv.sources] block) before merging once pipecat 1.2.0 is on PyPI.
A model that emits a non-dict entry in ``fills`` (or a non-string ref in ``highlight`` / ``click``) would have crashed the tool body before ``respond_to_task`` ran. Because UIAgent acquires the single-flight task lock in ``on_task_request`` and only releases it via ``respond_to_task`` (or the cancellation path), an unhandled exception in the tool would have stranded the lock until the voice-agent's 30s task timeout fired ``on_task_cancelled`` — 30s of UI deadlock for what's almost always a transient model hiccup. Skip non-conforming entries instead. The fix is at the LLM-input boundary (the contents of the list arguments) rather than a broad try/finally so that real bugs in the helpers still surface as exceptions. Adds three regression tests covering the non-dict ``fills``, non-string ``highlight``, and non-string ``click`` cases. Each one asserts the critical invariant: ``respond_to_task`` and ``result_callback`` still run, so the lock is released. Reported by Codex review of #18.
A refresh of the v1 primer that lands the moving parts since the original was written: - Wire format moved to Pipecat as canonical (single-LLM apps don't need subagents). Layered "the pieces" treatment of all four packages plus the reference app, with an architecture diagram. - Full action vocabulary: select_text, click, set_input_value alongside the original scroll_to / highlight / focus / toast / navigate. Tied to a "what it enables" table that maps user capabilities to wire-format pieces. - Task lifecycle protocol (start_user_task_group / ui-task / useUITasks) treated as a first-class deployment dimension. - Two orthogonal deployment knobs (history mode + task shape) called out so the right corner is easy to pick. - "When (not) to use this" rewritten as app-shape fit followed by deployment-shape decision (single LLM / voice+UI / multi-agent). v1 stays in place at UI_AGENT_DESIGN.md. v2 keeps the same audience (internal team) and tone (conversational primer); v1 is referenced for the longer "How information flows" treatment and the other two sequence diagrams.
d67ffbf to
03ecd79
Compare
Summary
Adds
UIAgentplus the orchestration helpers needed for AI agents that observe and drive a GUI app through a structured a11y-snapshot wire format. Six runnable demos exercise the patterns in isolation and in combination.The wire format itself is now first-class in Pipecat (companion PR pipecat-ai/pipecat#4407): five new RTVI message types (
ui-event,ui-command,ui-snapshot,ui-cancel-task,ui-task), paired pydantic envelope models, and the matching pipeline frames live inpipecat.processors.frameworks.rtvi.models. The matching client-side support lives in@pipecat-ai/client-jsand@pipecat-ai/client-react(companion PR pipecat-ai/pipecat-client-web#203). This subagents PR builds the agent abstractions on top of that wire format. Single-LLM Pipecat apps that want UI Agent semantics without the subagents framework can target the wire format directly.Bumps the minimum
pipecat-aidependency to>=1.2.0.What's added
Core SDK (
src/pipecat_subagents/agents/ui/):UIAgent(subclass ofLLMContextAgent) that:<ui_state>at the start of every task.ui-eventRTVI messages to@on_ui_event(name)handlers without running the LLM, for low latency.respond_to_task(...)and acurrent_taskproperty so tools don't have to threadtask_idmanually.on_task_requestacquires a per-agent lock that is held untilrespond_to_taskfires, so overlapping requests queue rather than interleaving their context mutations. The lock is also released on cancellation, so a cancelled task can't strand the agent.keep_historyflag for multi-turn UIs (defaults toFalse, the canonical stateless-delegate pattern that pairs with the voice/UI separation).send_command(name, payload)for server-to-client UI commands, going out as first-classui-commandRTVI messages. Pairs with the standard payload models that ship in pipecat (Toast,Navigate,ScrollTo,Highlight,Focus,SelectText,SetInputValue,Click); apps publish their own command names freely.Action helpers on
UIAgent:scroll_to,highlight,select_text,click,set_input_value. Plain instance methods (not LLM tools) that wrapsend_commandwith the standard payloads.ReplyToolMixin: one bundledreply(answer, scroll_to=None, highlight=None, select_text=None, fills=None, click=None)LLM tool. Requiredanswerargument keeps smaller models from omitting the spoken terminator (a real failure mode of the chainable-mixin shape we tried first). One tool call per turn, no chaining.start_user_task_group(...): fire-and-forget counterpart to theuser_task_groupcontext manager. Dispatches a worker fan-out, returns thetask_id, and lets workers run in a background asyncio task that the SDK manages.attach_ui_bridge(root_agent)that wires the new first-class UI RTVI channels to the agent bus in both directions:RTVIProcessor.on_ui_message.ui-eventandui-snapshotfrom the client becomeBusUIEventMessageon the bus (the snapshot is routed toUIAgentfor<ui_state>injection; events fan out to handlers).BusUICommandMessagefrom any agent leaves the bus as anRTVIUICommandFrame(UI commands) orRTVIUITaskFrame(task lifecycle envelopes), which the RTVI observer wraps into the matchingUICommandMessage/UITaskMessageenvelopes on the wire.<selection>block in<ui_state>for read-side deixis (text the user has highlighted in the client).UI_STATE_PROMPT_GUIDEconstant: canonical prompt fragment that documents the<ui_state>/<ui_event>context tags the LLM sees. Apps concatenate it into their system prompt.New bus message types:
BusUIEventMessage,BusUICommandMessage(inagents/ui/ui_messages.py).Six demos (
examples/local/ui-agent/), each isolating one concept:hello-snapshotpointinghighlightaction grounded by<ui_state>refsdeixis<selection>blockform-fillasync-tasksstart_user_task_group, streaming task updatesdocument-reviewFor reviewers
[tool.uv.sources]pin inpyproject.tomlbefore merging. Commit4aa3fbdadds a temporary[tool.uv.sources]block that resolvespipecat-ai>=1.2.0from the open wire-format PR (feat(rtvi): add UI Agent Protocol as first-class RTVI message types pipecat#4407). It exists so reviewers and CI can resolve the dep before pipecat 1.2.0 ships on PyPI. Once 1.2.0 lands, drop that commit (or the block) so the published package and downstream installs resolve from PyPI. The override is install-time-only — uv strips[tool.uv.sources]from the published distribution — but leaving it in the repo would mask a regression where 1.2.0 fails to resolve cleanly.UIAgentClient, React idioms, standard handlers). All three are additive; no existing wire shapes change. The RTVIPROTOCOL_VERSIONbumps from1.2.0to1.3.0— minor bump, major-version compat check still passes.LLMContextAgentextension, then ergonomics iteration (chainable mixins to bundledReplyToolMixin), then the orchestration primitives, then the wire-format migration to first-class RTVI types, then example/feature pairs (each demo paired with the SDK change it exercises). The top-levelexamples/local/ui-agent/README.mdis a good entry point for the demo side.Test plan
uv run pytestpasses (281 tests)4aa3fbd(or the[tool.uv.sources]block inpyproject.toml) once pipecat 1.2.0 is on PyPI