Skip to content

UIAgent#18

Open
markbackman wants to merge 57 commits into
mainfrom
mb/ui-agent
Open

UIAgent#18
markbackman wants to merge 57 commits into
mainfrom
mb/ui-agent

Conversation

@markbackman
Copy link
Copy Markdown
Contributor

@markbackman markbackman commented Apr 25, 2026

Summary

Adds UIAgent plus the orchestration helpers needed for AI agents that observe and drive a GUI app through a structured a11y-snapshot wire format. Six runnable demos exercise the patterns in isolation and in combination.

The wire format itself is now first-class in Pipecat (companion PR pipecat-ai/pipecat#4407): five new RTVI message types (ui-event, ui-command, ui-snapshot, ui-cancel-task, ui-task), paired pydantic envelope models, and the matching pipeline frames live in pipecat.processors.frameworks.rtvi.models. The matching client-side support lives in @pipecat-ai/client-js and @pipecat-ai/client-react (companion PR pipecat-ai/pipecat-client-web#203). This subagents PR builds the agent abstractions on top of that wire format. Single-LLM Pipecat apps that want UI Agent semantics without the subagents framework can target the wire format directly.

Bumps the minimum pipecat-ai dependency to >=1.2.0.

⚠️ Requires a Pipecat release (1.2.0) before this can be merged.

What's added

Core SDK (src/pipecat_subagents/agents/ui/):

  • UIAgent (subclass of LLMContextAgent) that:

    • Stores the latest accessibility snapshot from the client and auto-injects it as <ui_state> at the start of every task.
    • Routes inbound ui-event RTVI messages to @on_ui_event(name) handlers without running the LLM, for low latency.
    • Provides respond_to_task(...) and a current_task property so tools don't have to thread task_id manually.
    • Single-flight task semantics: on_task_request acquires a per-agent lock that is held until respond_to_task fires, so overlapping requests queue rather than interleaving their context mutations. The lock is also released on cancellation, so a cancelled task can't strand the agent.
    • Has a keep_history flag for multi-turn UIs (defaults to False, the canonical stateless-delegate pattern that pairs with the voice/UI separation).
  • send_command(name, payload) for server-to-client UI commands, going out as first-class ui-command RTVI messages. Pairs with the standard payload models that ship in pipecat (Toast, Navigate, ScrollTo, Highlight, Focus, SelectText, SetInputValue, Click); apps publish their own command names freely.

  • Action helpers on UIAgent: scroll_to, highlight, select_text, click, set_input_value. Plain instance methods (not LLM tools) that wrap send_command with the standard payloads.

  • ReplyToolMixin: one bundled reply(answer, scroll_to=None, highlight=None, select_text=None, fills=None, click=None) LLM tool. Required answer argument keeps smaller models from omitting the spoken terminator (a real failure mode of the chainable-mixin shape we tried first). One tool call per turn, no chaining.

  • start_user_task_group(...): fire-and-forget counterpart to the user_task_group context manager. Dispatches a worker fan-out, returns the task_id, and lets workers run in a background asyncio task that the SDK manages.

  • attach_ui_bridge(root_agent) that wires the new first-class UI RTVI channels to the agent bus in both directions:

    • Inbound: subscribes to RTVIProcessor.on_ui_message. ui-event and ui-snapshot from the client become BusUIEventMessage on the bus (the snapshot is routed to UIAgent for <ui_state> injection; events fan out to handlers).
    • Outbound: BusUICommandMessage from any agent leaves the bus as an RTVIUICommandFrame (UI commands) or RTVIUITaskFrame (task lifecycle envelopes), which the RTVI observer wraps into the matching UICommandMessage / UITaskMessage envelopes on the wire.
  • <selection> block in <ui_state> for read-side deixis (text the user has highlighted in the client).

  • UI_STATE_PROMPT_GUIDE constant: canonical prompt fragment that documents the <ui_state> / <ui_event> context tags the LLM sees. Apps concatenate it into their system prompt.

  • New bus message types: BusUIEventMessage, BusUICommandMessage (in agents/ui/ui_messages.py).

Six demos (examples/local/ui-agent/), each isolating one concept:

Demo Pattern
hello-snapshot Foundational: a11y snapshot streaming + UIAgent task dispatch
pointing highlight action grounded by <ui_state> refs
deixis Read-side text-selection grounding via <selection> block
form-fill Input fill + click actions, multi-field tools
async-tasks Parallel fan-out via start_user_task_group, streaming task updates
document-review Synthesis demo combining all of the above

For reviewers

  • ⚠️ MERGE BLOCKER — revert the [tool.uv.sources] pin in pyproject.toml before merging. Commit 4aa3fbd adds a temporary [tool.uv.sources] block that resolves pipecat-ai>=1.2.0 from the open wire-format PR (feat(rtvi): add UI Agent Protocol as first-class RTVI message types pipecat#4407). It exists so reviewers and CI can resolve the dep before pipecat 1.2.0 ships on PyPI. Once 1.2.0 lands, drop that commit (or the block) so the published package and downstream installs resolve from PyPI. The override is install-time-only — uv strips [tool.uv.sources] from the published distribution — but leaving it in the repo would mask a regression where 1.2.0 fails to resolve cleanly.
  • Companion PRs land the wire format on each side. feat(rtvi): add UI Agent Protocol as first-class RTVI message types pipecat#4407 (canonical RTVI types and pipeline frames) and Client updates for UI agent pipecat-client-web#203 (UIAgentClient, React idioms, standard handlers). All three are additive; no existing wire shapes change. The RTVI PROTOCOL_VERSION bumps from 1.2.0 to 1.3.0 — minor bump, major-version compat check still passes.
  • Reading order suggestion: the SDK foundations land first (commits up through the action commands), then the subpackage refactor and LLMContextAgent extension, then ergonomics iteration (chainable mixins to bundled ReplyToolMixin), then the orchestration primitives, then the wire-format migration to first-class RTVI types, then example/feature pairs (each demo paired with the SDK change it exercises). The top-level examples/local/ui-agent/README.md is a good entry point for the demo side.

Test plan

  • uv run pytest passes (281 tests)
  • All six demos run end-to-end against their React clients
  • Reviewer verifies one or two demos locally per their READMEs
  • Before merge: drop commit 4aa3fbd (or the [tool.uv.sources] block in pyproject.toml) once pipecat 1.2.0 is on PyPI

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 25, 2026

Codecov Report

❌ Patch coverage is 95.16129% with 21 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/pipecat_subagents/agents/ui/ui_agent.py 93.23% 18 Missing ⚠️
src/pipecat_subagents/agents/ui/ui_task_context.py 93.33% 2 Missing ⚠️
src/pipecat_subagents/agents/ui/ui_tools.py 97.22% 1 Missing ⚠️
Files with missing lines Coverage Δ
src/pipecat_subagents/agents/__init__.py 100.00% <100.00%> (ø)
src/pipecat_subagents/agents/ui/__init__.py 100.00% <100.00%> (ø)
src/pipecat_subagents/agents/ui/ui_bridge.py 100.00% <100.00%> (ø)
.../pipecat_subagents/agents/ui/ui_event_decorator.py 100.00% <100.00%> (ø)
src/pipecat_subagents/agents/ui/ui_messages.py 100.00% <100.00%> (ø)
src/pipecat_subagents/agents/ui/ui_prompts.py 100.00% <100.00%> (ø)
src/pipecat_subagents/bus/__init__.py 100.00% <ø> (ø)
src/pipecat_subagents/agents/ui/ui_tools.py 97.22% <97.22%> (ø)
src/pipecat_subagents/agents/ui/ui_task_context.py 93.33% <93.33%> (ø)
src/pipecat_subagents/agents/ui/ui_agent.py 93.23% <93.23%> (ø)

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@markbackman markbackman force-pushed the mb/ui-agent branch 6 times, most recently from 74c6a39 to 8208302 Compare May 1, 2026 21:43
@markbackman markbackman requested a review from aconchillo May 1, 2026 21:45
@markbackman markbackman marked this pull request as ready for review May 1, 2026 21:45
Comment thread src/pipecat_subagents/bus/messages.py Outdated
@markbackman markbackman force-pushed the mb/ui-agent branch 2 times, most recently from 5189cd9 to 1394ec4 Compare May 2, 2026 13:17
@markbackman markbackman changed the title UI agent POC UIAgent May 2, 2026
markbackman added a commit to markbackman/pipecat-music-player that referenced this pull request May 2, 2026
Temporary [tool.uv.sources] overrides so reviewers and CI can resolve
``pipecat-ai>=1.1.0`` and ``pipecat-ai-subagents>=0.4.0`` from the open
wire-format PRs before either is published to PyPI.

Companion PRs:
- pipecat-ai/pipecat#4407
- pipecat-ai/pipecat-subagents#18

uv strips [tool.uv.sources] when building the distribution, so this
is install-time-only and does not affect the published demo.

Drop this commit (or just the [tool.uv.sources] block) before
merging once both upstreams are on PyPI.
markbackman added a commit that referenced this pull request May 2, 2026
A model that emits a non-dict entry in ``fills`` (or a non-string
ref in ``highlight`` / ``click``) would have crashed the tool body
before ``respond_to_task`` ran. Because UIAgent acquires the
single-flight task lock in ``on_task_request`` and only releases
it via ``respond_to_task`` (or the cancellation path), an
unhandled exception in the tool would have stranded the lock until
the voice-agent's 30s task timeout fired ``on_task_cancelled`` —
30s of UI deadlock for what's almost always a transient model
hiccup.

Skip non-conforming entries instead. The fix is at the LLM-input
boundary (the contents of the list arguments) rather than a broad
try/finally so that real bugs in the helpers still surface as
exceptions.

Adds three regression tests covering the non-dict ``fills``,
non-string ``highlight``, and non-string ``click`` cases. Each one
asserts the critical invariant: ``respond_to_task`` and
``result_callback`` still run, so the lock is released.

Reported by Codex review of #18.
@markbackman markbackman force-pushed the mb/ui-agent branch 2 times, most recently from bf4ced9 to ba9b84f Compare May 6, 2026 21:42
),
)

async def on_task_request(self, message: BusTaskRequestMessage) -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be:

@task(name="hello")
async def on_hello(self, , message: BusTaskRequestMessage) -> None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to await super().on_task_request(message) will happen automatically.

await super().on_ready()
# Route inbound UI events (incl. the reserved snapshot event)
# at HelloAgent — the snapshot is what HelloAgent reasons over.
attach_ui_bridge(self, target="hello")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling this attach_ui_bridge is a bit weird. We already have the notion of a bridge, which is a frame processor that outputs and inputs from the bus. Maybe we can also do this as a decorator that gets automatically called when the pipeline is ready.

What about?

@ui_agent(agent="hello")
class HelloRoot(BaseAgent)

could there be multiple UI agents? I think it should be possible. If so, maybe:

@ui_agent(agents=["hello", "hello2"])
class HelloRoot(BaseAgent):
....

# auto_inject_ui_state is on) injects ``<ui_state>``.
# Then feed the user's query into the LLM context with
# ``run_llm=True`` so the LLM actually generates.
await super().on_task_request(message)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we document the @task(name="xxxx") instead?

``run_llm=True`` the context contains exactly: current
``<ui_state>`` + query.
"""
await self._task_lock.acquire()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is assuming a UIAgent can only run one task a time. That is, it is hard for the user to run additional tasks on a UIAgent. Need to think about this, but it feels a bit limiting.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can create a @ui_task(name="name") decorator instead that blocks... Or maybe even more generic @task(name="hello", sync=True) meaning that this task will block other tasks with the same name. I'll think about it.

UIAgent(LLMAgent) dispatches BusUIEventMessage to @on_ui_event
handlers (each runs in its own asyncio task so the bus dispatcher is
never held open) and exposes send_command for server-to-client
messages. attach_ui_bridge forwards RTVI client messages onto the bus
and translates BusUICommandMessage back to RTVIServerMessageFrame
pushed into the root agent's pipeline. Bus gains BusUIEventMessage,
BusUICommandMessage, and the matching type constants; standard
command payloads (Toast, Navigate, ScrollTo, Highlight, Focus) ship
alongside.

Events inject as <ui_event name="...">payload</ui_event> developer
messages into the LLM context by default; override render_ui_event
or set inject_events=False to opt out.
Adds snapshot storage and LLM-context injection for the
__ui_snapshot reserved event. UIAgent.render_ui_state() produces
Playwright-MCP-style indented text with stable refs, offscreen
tags, and grid dimensions. inject_ui_state() queues a developer
message. visible_nodes() returns the non-offscreen subset.

ScrollTo/Highlight/Focus commands now accept ref alongside
target_id so the server can reference nodes it saw in <ui_state>.
UIAgent now injects the latest <ui_state> snapshot into the LLM
context at the start of every task request, so the agent always
reasons over the current screen. Controlled by the new
auto_inject_ui_state constructor option (default True). Apps that
override on_task_request pick up the behavior via super().

Adds UI_STATE_PROMPT_GUIDE, a canonical prompt fragment that
documents the <ui_state> and <ui_event> wire format. Apps
concatenate it into their system prompt so the LLM understands the
SDK-managed developer messages it receives, and we can evolve the
format in one place.

Removes the spike-only log_snapshots option, _log_snapshot_receipt,
_previous_snapshot_root_ref, and the now-unused _count_nodes helper.
Adds an opt-in ``log_snapshots`` constructor flag that emits a
``logger.debug`` line on every accessibility snapshot received
(node count, char count, rough token estimate, and the full
rendered ``<ui_state>``). Defaults to off. Useful in dev / staging
for eyeballing what the LLM will see before the next inject.

Ships ``ScrollToToolMixin`` alongside ``UIAgent`` in a new
``ui_tools`` module. Apps that want the LLM to be able to scroll
offscreen elements into view inherit the mixin, which exposes a
``scroll_to(ref)`` tool. The tool dispatches a standard ``ScrollTo``
command by ref; keeping it as a mixin (not a base-class method)
means single-screen apps don't get a ``scroll_to`` tool
cluttering their LLM tool list.
Mirrors the ScrollTo mixin: apps that want the LLM to be able to
point at visible elements ("which one is Radiohead?") inherit
``HighlightToolMixin`` alongside ``UIAgent``. The mixin exposes a
``highlight(ref)`` tool that dispatches the standard ``Highlight``
command by snapshot ref. The client's
``useStandardHighlightHandler`` (or a custom one) does the visual
effect.

Like the scroll mixin, keeping this as a separate mixin means
agents that don't need visual highlighting don't pay the
tool-list cost.

``test_ui_tools`` grows from 3 to 7 cases: per-mixin coverage
(expose-tool, plain-agent-doesn't, dispatch-by-ref) plus a
combined-mixin check.
Adds `current_task` tracking and a `respond_to_task(...)` helper on
UIAgent so `@tool` methods can complete the in-flight task without
threading the task id through every call. The shipped mixin tools
(`scroll_to`, `highlight`) now dispatch the UI command, complete the
task with no `speak` field, and exit silently. The visual change on
the client is the user-facing feedback; apps that want spoken
narration override the mixin tool and pass `speak`.
The user asks the assistant to research a topic. The UI agent
spawns a background asyncio task that runs user_task_group(...)
across three worker agents (Wikipedia, news, scholar). The SDK
auto-forwards every task lifecycle event to the client as ui.task
envelopes — group_started, task_update, task_completed,
group_completed — and the client renders an in-flight card with
per-worker status. The user can cancel any group via a Cancel
button on the card; the SDK ships the __cancel_task event back to
the agent which calls cancel_task() on the registered group.

The custom @tool reply has a research_query field. When set, the
tool spawns the task group via create_asyncio_task and returns
immediately with the spoken acknowledgement ("Researching X now").
The voice agent isn't blocked; it can handle follow-up turns
while workers run in the background.

Workers are simulated BaseAgent subclasses with on_task_request
that emit a few send_task_update progress messages followed by a
send_task_response with a canned summary. asyncio.sleep with
randomized intervals makes the streaming UI come alive without
needing real data sources.

This rounds out the demo arc: hello-snapshot (read), pointing
(visual point), deixis (text point), form-fill (state-changing
actions), async-tasks (fan-out + streaming progress + cancel).
Each demo is a one- or two-line composition of the SDK primitives
plus a focused page.
The canonical fire-and-forget pattern with user_task_group required
ceremony: a separate method to host the async with, a unique
asyncio task name string, and a pass body for the context manager.
The async-tasks demo shows the pattern in full:

  self.create_asyncio_task(self._run_research(query), f"...")
  ...
  async def _run_research(self, query):
      async with self.user_task_group(...):
          pass

Replace with one call:

  await self.start_user_task_group(
      "wikipedia", "news", "scholar",
      payload={"query": query},
      label=f"Research: {query}",
  )

Returns the task_id once the group_started envelope has fired
(so the client renders immediately) and runs the context to
completion in a background asyncio task the SDK manages. The
context-manager form stays available for callers that want to
consume worker events inline (async for event in tg).

Updates the async-tasks demo to use the new helper. Drops the
_run_research method and the create_asyncio_task ceremony from the
LLM tool body; the reply tool body is now ~10 lines shorter.
…ancelled

When the client cancels an in-flight user_task_group, the group's
wait() raises TaskGroupError on __aexit__ ("user requested" or
similar reason). For the context-manager form that bubbles to the
caller as expected. For start_user_task_group's background runner
it was bubbling to the asyncio task manager and getting logged as
an unexpected exception.

Cancellation is an expected exit for fire-and-forget groups: the
client already knows because it received the group_completed
envelope. Catch TaskGroupError around the __aexit__ call and log
at debug. Other exceptions still log at warning.

Also restructures the runner so iteration exceptions are forwarded
into __aexit__ correctly (not swallowed by a finally that calls
__aexit__(None, None, None) instead).

Adds a regression test that triggers the exact path:
start_user_task_group with a slow worker, cancel, and verify the
group_completed envelope still publishes and the task group is
cleaned up — without leaking TaskGroupError to the test runner.
Combines every prior demo's pattern into one workspace where the
user reviews a draft article by voice. The user can:

- Select a paragraph and ask for review. ReviewAgent calls
  start_review(answer, paragraph_ref, paragraph_text), which spawns
  two specialist worker agents (clarity, tone) via
  start_user_task_group. Workers stream progress to an in-flight
  card. As each completes, on_task_response intercepts and emits an
  add_note custom command, attaching the worker's feedback to the
  paragraph as a note.

- Dictate notes by voice. Reply tool's fills + click drive the
  textarea and Save button, same as form-fill.

- Ask "where does it talk about X" and the agent uses select_text
  + scroll_to to navigate, same as deixis write direction.

- Click any note in the panel. Client emits a note_click UI event;
  the agent's @on_ui_event("note_click") handler dispatches
  select_text to jump to the related paragraph. Round-trip
  event/command pattern.

Demonstrates two patterns no prior demo touched:

- Custom UI command (add_note) registered locally on the client.
  The server emits it via send_command; the client's handler
  renders a note card. Apps register their own command names freely.

- Custom client-emitted event (note_click). When the user clicks a
  note, the client calls ui.sendEvent("note_click", {ref}); the
  agent's @on_ui_event handler reacts.

Two LLM tools coexist: reply (from ReplyToolMixin, for normal
turns) plus a custom start_review (for paragraph review kick-off).
The prompt steers the model to pick the right one. Single tool
call per turn — no chainable coordination problems.

Workers are simulated, like async-tasks: they compute simple text
metrics (word count, sentence count, presence of absolutist or
hedging words) and emit templated feedback that varies meaningfully
per paragraph. The demo is about orchestration, not real NLP.
A real app swaps the workers for LLMAgent subclasses without
changing anything else.

The article is a 6-paragraph draft with deliberately uneven
paragraphs: ¶2 too dense, ¶3 too vague, ¶4 absolutist tone, ¶5/¶6
balanced. So the workers actually have something to flag.
The form-submit handler was trying to find the ref by walking up
selection.anchorNode looking for a dataset.ref attribute the walker
never sets. So notes always showed up unattached.

Fix:

- Track lastArticleRef on every selectionchange that lands inside
  the article column. Filtered to article ancestors so subsequent
  textarea selection (when the user or agent types) doesn't
  overwrite it.
- Use the new findRefForElement client SDK helper to resolve
  selection ancestors → ref. Walks up parentElement to find the
  closest snapshot-known container.
- Submit handler reads lastArticleRef instead of trying to derive
  the ref from the live selection at submit time. This works for
  both the manual flow (user types + clicks Save) and the voice
  flow (agent fills textarea + clicks Save) because the textarea
  focus doesn't clear lastArticleRef.

Result: select a paragraph, type a note (or dictate one), the note
attaches to that paragraph. Click the note in the panel and the
page jumps back to the paragraph it's attached to. Closes the loop
between dictation and deixis.

Tightens the prompt's note-flow description to reflect the actual
behavior — the client tracks selection across the textarea fill,
so the agent doesn't need to thread the ref through the tool call.
Two related changes from real-session testing:

1. ReviewAgent now defaults to keep_history=True so the UI agent
   can resolve conversational deixis. The "can we have a note for
   that?" case fails when keep_history=False because the UI agent's
   context is fresh per turn — "that" has no antecedent. With
   keep_history=True the UI agent sees its own prior replies and
   resolves the reference correctly. Constructor signature now
   accepts keep_history as an explicit kwarg so apps that want
   fresh-per-turn can override.

   This is the right default for multi-turn interactive apps where
   the user and agent work through something together (review,
   iterate, refine). The original keep_history=False default on
   UIAgent stays correct for stateless-delegate apps (pointing,
   form-fill, async-tasks) where each turn is "given the current
   screen, do X" with no carryover.

2. UI prompt now starts with a "hard rule" requiring every turn to
   call exactly one tool (reply or start_review). The earlier
   prompt described both tools but never required calling one;
   for open questions like "how can we improve it?" gpt-4o-mini
   would sometimes produce 50 tokens of plain text and the voice
   agent's task() would time out. The hard rule plus an explicit
   "general questions go through reply" decision rule prevents that.

Caught in the same session: ReviewAgent's prior __init__ signature
hardcoded keep_history=False with no kwarg passthrough, so call
sites trying to flip it (or even pass it) would TypeError during
add_agent inside on_client_ready and silently abort the rest of
the handler — which left the registry without ui/clarity/tone
agents and surfaced as "agents not ready within timeout" much
later when the voice agent first delegated. Worth knowing that
exceptions in RTVI event handlers don't propagate noisily.
Indexes the six demos in difficulty order (hello-snapshot →
pointing → deixis → form-fill → async-tasks → document-review)
with a one-paragraph summary of what each shows. Defers per-demo
specifics to each demo's own README.

Also covers shared concerns once: how to run any demo (the npm +
uv pattern is the same everywhere), the API keys all demos need,
and a quick map back to the SDK's public surface for readers
exploring the directory cold.
The project uses changelog/<PR>.<type>.md fragments per PR (see
existing 19.changed.md and 20.added.md); CHANGELOG.md gets compiled
at release time. The earlier direct edit to CHANGELOG.md's
[Unreleased] section short-circuited that flow. Moves the content
to changelog/18.added.md and reverts CHANGELOG.md to match main.
Two correctness issues raised in code review:

1. Single _current_task slot races under concurrent dispatch. Two
   overlapping on_task_request calls overwrite each other's task
   handle, so the first task's tool calls respond_to_task() with
   the wrong task_id. Even if the slot were per-request (e.g.
   ContextVar), the agent has only one LLM context and one running
   pipeline; concurrent processing would still interleave context
   mutations and corrupt the conversation.

2. The keep_history=False reset wipes any messages pre-seeded via
   context= on LLMContextAgent's constructor, contradicting the
   inherited contract.

Fixes (1) by acquiring an asyncio.Lock in on_task_request and
holding it until respond_to_task fires. Concurrent submissions
queue and process in arrival order. The lock release lives in
respond_to_task; a tool that forgets to call it will hang the
agent on the next task, which is the correct fast-surfacing
signal that something is wrong (no watchdog: it would mask the
bug).

Fixes (2) via documentation in the keep_history docstring,
calling out that persistent app instructions belong in the LLM's
system_instruction setting (which lives outside the context
message list and is unaffected by the reset).

Adds test_concurrent_task_requests_serialize covering the
overlap case.
The constructor's context= arg doc previously said it was for
seeding 'an initial system prompt or message history,' which
conflicts with the default keep_history=False reset behavior.
Updates to call out that seeded messages are part of mutable task
history and get cleared, and points readers to system_instruction
for anything durable.

Same clarification in reset_context()'s docstring (seeded messages
ARE affected by the reset). And keep_history's note now points at
UI_STATE_PROMPT_GUIDE as a concrete example of what to put in
system_instruction.
BaseAgent._handle_task_cancel sends the CANCELLED response
directly via send_task_response and bypasses respond_to_task,
which is where the lock release lives. Without an on_task_cancelled
hook the lock would stay held after a cancellation and every
subsequent UI task request would block at on_task_request's
acquire forever.

Override on_task_cancelled to clear _current_task and release
_task_lock when the cancelled task_id matches the in-flight one.
Idempotent and race-safe: the current_task identity check makes
it a no-op when respond_to_task fired first, and the locked()
guard makes the release safe regardless of what cleared the slot.

Adds two focused tests:
- cancellation_releases_lock_for_subsequent_tasks: the bug Codex
  flagged; a follow-up task must not block.
- cancellation_for_unrelated_task_id_leaves_lock_held: confirms
  we only react to cancels that match the current task.
The UI Agent Protocol wire format (envelope-type strings, reserved
event names, task-lifecycle kind discriminators, and the built-in
command payload dataclasses) now lives in
pipecat.processors.frameworks.rtvi.ui as of pipecat-ai 1.2.0.
Single-LLM Pipecat apps and other frameworks can now target the
same wire format without taking a subagents dependency.

Subagents continues to re-export the same names from
pipecat_subagents.bus and pipecat_subagents.agents so existing
imports keep working; the canonical definitions just moved.

Also replaces the inline 'group_started' / 'task_update' / etc.
string literals in the UI bridge with the new
UI_TASK_*_KIND constants from pipecat.
These bus messages are subagents-internal carriers exchanged only
between UIAgent and the bridge installed by attach_ui_bridge. They
have no use outside the UI subpackage. Co-locating them with the
agent that consumes them removes a layering wart in bus/messages.py
and matches the directory shape of the rest of the UI surface.

Removes the BusUI* re-exports from pipecat_subagents.bus and adds
them under pipecat_subagents.agents.ui (and at agents.ui.ui_messages
for direct import). All internal callers and tests updated to the
new path.
Pipecat-ai 1.2.0 promotes the UI Agent Protocol to first-class
RTVI message types (ui-event, ui-command, ui-snapshot,
ui-cancel-task, ui-task) instead of sub-types carried inside
server-message / client-message. Update the bridge to match:

- attach_ui_bridge subscribes to on_ui_message on the RTVI
  processor (instead of on_client_message). UIEventMessage,
  UISnapshotMessage, and UICancelTaskMessage are translated onto
  the bus as BusUIEventMessage carriers; the snapshot and
  cancel-task carry subagents-internal event names so UIAgent's
  existing dispatch keeps working.
- Outbound: BusUICommandMessage and the four BusUITask* messages
  are emitted as RTVIServerTypedMessageFrame wrapping
  UICommandMessage / UITaskMessage envelopes (instead of
  RTVIServerMessageFrame with a dict).

Subagents-internal _UI_SNAPSHOT_BUS_EVENT_NAME and
_UI_CANCEL_TASK_BUS_EVENT_NAME constants in agents/ui/ui_messages
replace the wire-format reserved event names (which were public
constants in pipecat). UIAgent dispatches on these internal names.

Tests updated: bridge tests cover the new UI message inputs and
the typed frame outputs; the cancel-task tests now construct
UICancelTaskMessage instead of forging a ui-event with a reserved
name.

All 282 subagents tests pass against the local pipecat checkout.
Pipecat's RTVI now ships RTVIUICommandFrame and RTVIUITaskFrame as
domain-scoped pipeline frames, mirroring how RTVIServerMessageFrame
and the LLM/TTS frames work: the frame carries domain data, the
observer wraps it into the matching typed RTVI envelope before
sending. Switch the bridge over to push these instead of the
generic RTVIServerTypedMessageFrame.

The generic typed-message frame is gone from pipecat. This is a
better fit with the rest of the RTVI surface: a reader doesn't
have to inspect what's inside the frame to know what's being sent,
the frame name itself tells them. Symmetric with how LLM events
flow (LLMFunctionCallStartedFrame produces an
llm-function-call-started envelope inside the observer).
The constant is going away on the pipecat side (it was redundant
with the Literal[...] field default on UICommandMessage). Drops
the import and the corresponding test assertion, and trims the
matching mention from the changelog fragment.
Temporary [tool.uv.sources] override so reviewers and CI can resolve
``pipecat-ai>=1.2.0`` from the open wire-format PR before pipecat
1.2.0 ships on PyPI. uv strips [tool.uv.sources] when building the
distribution, so this is install-time-only and does not affect the
published package.

Companion PR: pipecat-ai/pipecat#4407

Drop this commit (or just the [tool.uv.sources] block) before
merging once pipecat 1.2.0 is on PyPI.
A model that emits a non-dict entry in ``fills`` (or a non-string
ref in ``highlight`` / ``click``) would have crashed the tool body
before ``respond_to_task`` ran. Because UIAgent acquires the
single-flight task lock in ``on_task_request`` and only releases
it via ``respond_to_task`` (or the cancellation path), an
unhandled exception in the tool would have stranded the lock until
the voice-agent's 30s task timeout fired ``on_task_cancelled`` —
30s of UI deadlock for what's almost always a transient model
hiccup.

Skip non-conforming entries instead. The fix is at the LLM-input
boundary (the contents of the list arguments) rather than a broad
try/finally so that real bugs in the helpers still surface as
exceptions.

Adds three regression tests covering the non-dict ``fills``,
non-string ``highlight``, and non-string ``click`` cases. Each one
asserts the critical invariant: ``respond_to_task`` and
``result_callback`` still run, so the lock is released.

Reported by Codex review of #18.
A refresh of the v1 primer that lands the moving parts since the
original was written:

- Wire format moved to Pipecat as canonical (single-LLM apps don't
  need subagents). Layered "the pieces" treatment of all four
  packages plus the reference app, with an architecture diagram.
- Full action vocabulary: select_text, click, set_input_value
  alongside the original scroll_to / highlight / focus / toast /
  navigate. Tied to a "what it enables" table that maps user
  capabilities to wire-format pieces.
- Task lifecycle protocol (start_user_task_group / ui-task /
  useUITasks) treated as a first-class deployment dimension.
- Two orthogonal deployment knobs (history mode + task shape) called
  out so the right corner is easy to pick.
- "When (not) to use this" rewritten as app-shape fit followed by
  deployment-shape decision (single LLM / voice+UI / multi-agent).

v1 stays in place at UI_AGENT_DESIGN.md. v2 keeps the same audience
(internal team) and tone (conversational primer); v1 is referenced
for the longer "How information flows" treatment and the other two
sequence diagrams.
@markbackman markbackman force-pushed the mb/ui-agent branch 2 times, most recently from d67ffbf to 03ecd79 Compare May 21, 2026 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants