You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Today NBI has one deeply integrated coding agent: Claude Code (Claude mode). Every other agent (OpenAI Codex, opencode, Pi, GitHub Copilot CLI) is a launcher tile that opens a terminal and runs a command. The goal is a first-class Codex mode that behaves like Claude mode for the core agent UX: the agent drives the chat panel with the same streaming tool-call cards with diffs, per-tool approval, session resume, a settings tab, and skills management, instead of a terminal tile. Some Claude-specific surfaces (the permission-mode selector, plan-mode approval, the AskUserQuestion form, the heartbeat stall indicator, and managed-settings clamping) have no guaranteed analog and are tracked as open parity questions below. And a path to additional agent modes that behave the same way, without rebuilding the full depth of Claude mode for each one.
This proposes a shared agent-mode framework inside NBI that owns the agent UX (chat routing, tool-call cards, permission/confirmation cards, session resume, settings), fed by two kinds of backend:
Claude Code stays native, on claude_agent_sdk, with no behavior change. It is shipped, deeply integrated, and just gained the permission-mode selector. Migrating it would be regression risk for no change a user would notice.
Additional agents are reached through the Agent Client Protocol (ACP), a JSON-RPC-over-stdio standard (originated by Zed) that gives us tool-call streaming with diffs, per-tool permission requests, session resume, and MCP-server passing across many agents through one integration.
Codex's placement (native vs ACP) is left to a spike (Phase 0), because ACP's permission flow removes most of the reason Codex would otherwise need a bespoke native integration.
The framework and ACP are the means; the deliverable is agent modes that feel like Claude mode, the first of which is Codex.
What this issue asks: agreement on the direction (Claude native, ACP for additional agents) and a green light for the Phase 0 spike. Phases 1 and beyond would be split into their own issues once the spike reports.
Goal: a first-class Codex mode (and more like it)
The user-facing target is that switching NBI into Codex mode feels like Claude mode for the core agent UX: the agent drives the chat panel with streaming, tool-call cards, approvals, and resume, not a terminal tile. Each Claude-mode surface maps as follows. "Mechanism" is what delivers it; "Notes" is the honest caveat, including roughly how much NBI-side work it is.
ACP tool_call / tool_call_update with diff content (path/oldText/newText); the React cards are reused
needs a per-backend server-side shim mapping the agent's tool vocabulary and diff shape onto NBI's kind/label/diff model (Claude's is ~150 lines); collapsing into groups is an existing NBI renderer affordance, not from ACP
cooperative: in-process or auto-policy tools may skip it; ACP's allow_always means "remember" (persisted), which is not Claude's "approve for this session", so that label changes
ACP session/load for the resume itself; the listing and previews are NBI-side
medium: Claude's picker parses Claude's JSONL transcript format to build the list and previews, so each agent needs its own discovery/parse adapter, not a path swap
Settings tab (model selection, etc.)
a new codex_settings, modeled on claude_settings
medium: claude_settings threads through policy application, string overrides, credential scrubbing, ~10 capabilities fields, a save path, and a ~460-line React panel; mechanical but not a dict copy
Skills management
NBI's skills UI manages the agent's own skills directory; the agent's binary loads them
medium and spike-gated: see Skills below
Capabilities and policy gating, off by default
the existing feature-policy framework
medium-large: the repo documents a "seven places per policy" checklist, and Claude has roughly nine such policies; the managed-settings clamp is Claude-Code-specific with no Codex analog
Permission-mode selector (Default / Accept Edits / Plan / Bypass)
none portable
stays Claude-specific: ACP has a structured but agent-defined modes concept (session/set_mode), not this enum
Known parity gaps (surfaces with no guaranteed analog). These are real Claude-mode behaviors a user notices, and parity for them is uncertain, not assumed:
Plan-mode approval flow. Claude's present-the-plan, approve-or-continue, then reset-mode flow depends on the ExitPlanMode tool. Whether Codex exposes an equivalent approval step is unverified.
AskUserQuestion form. Claude can render a structured multiple-choice question mid-turn and feed the answer back. ACP's request_permission is allow/reject only, with no portable channel for an arbitrary question form.
Heartbeat and "still working" stall indicator. Claude mode emits a server-side heartbeat that drives the elapsed timer and the slow-server copy. ACP agents emit no such heartbeat, so this needs a new signal or it will not fire.
Mid-session skills hot-reload. Claude reloads skills into the running session and shows a "Skills reloaded" banner. Whether a Codex/ACP session can hot-reload skills mid-turn is open (the Skills section covers on-disk management, not live reload).
Smaller items to design rather than assume: custom spinner verbs, the process-tree cancel teardown (ACP cancel is a session/cancel notification, not a subprocess-tree kill), and the TodoWrite task-list and @-mention context surfaces.
So "behaves like Claude mode" is accurate for the core agent loop and the reusable React surfaces, and the framework plus ACP is what makes that affordable. It is not full parity: the NBI-side surfaces are each a medium build modeled on Claude's, and the gaps above are genuine open questions, not free wins.
Background: two integration layers, not one
NBI already has two separate "connect to a thing" layers, and this proposal only touches the second:
LLM providers: four registered adapters (GitHub Copilot, OpenAI-compatible, LiteLLM-compatible, Ollama). These are HTTP API clients used for default-mode chat, inline chat, and autocomplete. No changes proposed here. (Anthropic/Claude is not one of these adapters; it is reached through the agent-mode layer below, by driving the claude binary.)
Agent modes: a CLI agent drives the chat panel with its own tools, streaming, and approvals. Claude mode is the only one today. ACP slots in here, parallel to Claude mode.
ACP is about the agent layer, not the provider adapters.
Current state
Claude mode (notebook_intelligence/claude.py, 1,819 lines plus UI): forces chat routing to ClaudeCodeChatParticipant, runs the claude binary as a streaming subprocess via claude_agent_sdk, gates tool calls through a can_use_tool permission handler (custom_permission_handler), renders tool-call cards with inline diffs, supports plan mode and the permission-mode selector, resumes sessions from ~/.claude/projects, and exposes a settings tab and capabilities fields.
Codex / opencode / Pi / Copilot CLI: launcher tiles, registered through one shared helper (registerAgentCliLauncher in src/index.ts). Each resolves a CLI path and, on click, opens a Jupyter terminal and runs the command. This is good, deliberate design for "give me a terminal with the agent running," but it caps every non-Claude agent at that: no in-panel tool-call cards, no per-tool approvals, no session resume. First-class Codex is a request to move past the tile.
So "parity between Claude mode and Codex mode" is not tweaking two peers; it is building a second agent mode. ACP is what makes that affordable and extensible to more agents.
Why ACP
ACP standardizes the editor/agent boundary the way LSP standardized editor/language-server. The pieces that matter for NBI (all verified against the v1 schema):
Per-tool permission requests.session/request_permission is a client method the agent calls before running a tool, with standardized option kinds (allow_once / allow_always / reject_once / reject_always; note allow_always means remember the choice, not scope it to the session) and the tool call attached. This is the closest standardized analog to the confirmation card we built for Claude. It is the thing that lets a non-Claude agent reuse our approval UX instead of each one needing a bespoke handler.
Rich tool-call reporting.tool_call / tool_call_update carry id, title, kind (read / edit / delete / move / search / execute / think / fetch / switch_mode / other), status (pending / in_progress / completed / failed), and content including file diffs (path, oldText, newText) and live terminal output. Enough to drive tool-call cards with diffs.
Session resume via session/load (capability-gated).
MCP-server passing in session/new.
Reasoning streams and slash-command lists via agent_thought_chunk and available_commands_update.
Breadth. Gemini CLI speaks ACP natively; opencode and Goose natively; Claude Code via @zed-industries/claude-code-acp; Codex via @zed-industries/codex-acp; GitHub Copilot is in public preview. One ACP backend lights up that set.
We would drive ACP from the Jupyter server extension in Python. The Python SDK (agent-client-protocol on PyPI, currently ~0.10.x, community-maintained under the agentclientprotocol org and treated as canonical by the ACP docs) is the surface; it trails the core schema (Rust crate ~0.14.x), which is a real maturity caveat the spike must check.
The shape: shared UX, backends translate
The framework owns everything that should look the same regardless of agent: forced chat routing while an agent mode is active, the client lifecycle (connect, stream, cancel, reconnect), tool-call cards, the permission/confirmation cards and approve/reject flow, the session-resume picker, and the settings/capabilities scaffold. Backends translate to a specific agent: a claude backend (claude_agent_sdk) and an acp backend (the ACP Python SDK, parameterized by which agent binary to launch).
Two design realities make this less symmetric than "two backends behind one interface," and the framework must be designed around them rather than discovering them late:
The two permission models are control-flow-inverted. With claude_agent_sdk, the SDK owns the loop and calls ourcan_use_tool coroutine, which renders the card, blocks until the user answers, and returns the decision. With ACP, we own the JSON-RPC loop; the agent sends a session/request_permission request that arrives on a separate channel from the tool_call stream and must be correlated by tool-call id. The honest seam is therefore not "a callback both backends implement." The framework should own the orchestration (render card, await the user, map the answer), and a backend should expose only "normalize this native tool event into a common ToolCall" and "apply this common Decision." That interface has to be shaped around ACP's request/response model (the stricter one), because the callback model retrofits onto request/response more easily than the reverse.
Claude can do something ACP cannot, and the common type must not erase it.can_use_tool can return updated_input, rewriting the tool's input (NBI uses this for AskUserQuestion and ExitPlanMode plan approval). ACP's request_permission is allow/reject only. The shared Decision type must still carry updated_input even though only the native backend can honor it, or we silently drop a Claude capability we already ship. This is a lowest-common-denominator trap to avoid by design.
The point of the framework is to write the cards and the approval flow once. The point of naming the above is that the interface must be derived from both models, not extracted from Claude alone.
Claude native, Codex by spike
Claude stays native. Zed's @zed-industries/claude-code-acp adapter exists and is maintained, so Claude could ride ACP later, which makes this a reversible decision. But migrating shipped, deeply integrated software onto a pre-1.0 protocol now is risk for no behavior change a user would notice. Keep claude_agent_sdk.
Codex is greenfield, so its placement is a real choice. Codex's approval model is policy plus sandbox, and it also exposes a granular per-action permission mechanism that the codex-acp adapter already maps onto ACP's request_permission for exec and patch-apply tools. So the value ACP adds is a uniform approval surface across agents, not that Codex otherwise lacks one. The codex-acp adapter is real, Zed-maintained, and shipping every few weeks (it is what Zed itself uses for Codex). The open question is therefore not whether per-tool approval routes through ACP (it demonstrably does for those categories) but which tool categories surface for approval versus skip it, and under which approval policy. The spike maps that.
If Codex goes ACP, do not over-abstract the native side. A "pluggable native backend" interface with exactly one implementor (Claude, possibly forever) is ceremony, not abstraction. In that branch, the cleaner shape is: Claude stays a concrete special case that feeds the shared UX layer, and the framework is essentially an ACP multiplexer plus the shared cards/approval UI. Only if Codex (or another agent) goes native does a real native-backend interface earn its keep. The spike's Codex outcome therefore selects between two framework shapes, which is why Phase 1's interface cannot be finalized before it.
If Codex ends up native, ensure it is not also exposed as a Codex agent through the ACP path, to avoid two ways to run the same agent.
Phase 0: the spike
A standalone Python script (outside the extension) that drives Codex through codex-acp using the ACP Python SDK. Done when all of these are answered:
Permission coverage (load-bearing).codex-acp already routes exec and patch-apply approvals through session/request_permission. Map which tool categories surface for approval (exec, patch-apply, MCP, granular request_permissions) versus which skip it (pure in-process tools, or auto/never approval policies), and how that shifts with the approval policy. The aim is to know exactly what we can promise users about confirmation, not to discover whether approval routes at all.
Python SDK adequacy. Does agent-client-protocol (trailing the core schema) actually expose session/request_permission, tool_call / tool_call_update with diffs, session/load, and session/new with mcpServers? Pin exact package and version, and note openai-codex (the official Codex Python SDK, currently beta) only if we end up comparing the native path.
Embedding, not just protocol (the real risk). Drive the ACP client from a worker thread with its own event loop, with the permission answer arriving from a different thread, reproducing in miniature the cross-thread marshaling Claude mode needs (wait_for_chat_user_input polling a signal while the websocket handler answers on the Tornado loop). A green light from a plain asyncio.run() script proves the protocol works but says little about whether it survives the extension's threading model, which is where these integrations actually fail.
Bidirectional client load. ACP agents call back into the client (fs/*, terminal/*) concurrently with streaming tool updates, so the ACP backend is a bidirectional server, not a "send query, read stream" client like the Claude SDK hides. Confirm the Python SDK's loop services concurrent inbound requests cleanly.
MCP transport. Claude's JupyterUITools is an in-process SDK MCP server (create_sdk_mcp_server). ACP's session/newmcpServers expects stdio or socket servers, so the same tools likely need to run as a real stdio MCP server for ACP. Confirm we can pass and invoke them.
Skills loading under headless drive. Confirm a Codex driven through codex-acp actually loads skills from its skills directory (~/.agents/skills), so NBI's skills management has the same effect for Codex mode that it has for Claude mode.
Interface sketch as a deliverable. Produce a first cut of the backend interface derived from both the ACP shape observed here and the existing Claude shape, so the interface is informed by two models before any extraction.
The spike de-risks the Codex native-vs-ACP decision, the protocol adequacy, and (most importantly) the embedding, on an agent we actually care about, before framework code is written.
Phased plan and sequencing
The order matters and involves a real trade. The lower-risk order is spike, then a working ACP integration as a second concrete implementation alongside Claude, then extract the shared framework once two real implementations exist to factor against. Extracting the framework from Claude alone first is tempting (it has independent value as cleaner code and is reviewable in isolation) but risks encoding Claude's inverted, callback-driven assumptions as the interface and then bending ACP to fit. We propose to treat any interface drawn before the ACP integration exists as provisional.
Phase 0: spike + this proposal reviewed. Outcome: go/no-go on ACP, Codex's home, and a provisional interface sketch. Size: small, a few days of exploratory scripting, no production code.
Phase 1: a minimal ACP-driven agent mode as a second concrete integration (initially allowed to duplicate some card/approval wiring), validated end to end inside the extension. Size: medium, mostly new code behind an off-by-default toggle. Success: one ACP agent (Codex or Gemini) drives the chat panel with working tool-call cards and per-tool approval.
Phase 2: extract the shared UX/approval layer so Claude and the ACP backend feed one implementation; in the ACP-for-Codex branch, keep Claude concrete rather than abstracting a one-implementor native interface. Size: medium refactor, gated by review since it touches shipped Claude code. Success: Claude mode behaves identically to today, now feeding the shared layer, with existing tests green.
(If the team prefers the extract-first order for the independent-cleanup value, that is defensible, but the interface should still be treated as provisional until the ACP integration lands.)
Skills
Two different things hide under "skills," and only one is deferred:
Skills support in Codex mode is in scope and part of behaving like Claude mode. Skills are read by each agent's own binary, and Claude and Codex both build on the same open Agent Skills standard (agentskills.io, originated by Anthropic and moving into Linux Foundation governance under the Agentic AI Foundation): both consume a SKILL.md directory with required name and description frontmatter, and both reference the standard in their own first-party docs. The core format is genuinely shared; the differences are at the edges. The directory differs (Claude under ~/.claude/skills, Codex under ~/.agents/skills), and each vendor layers its own extensions (Claude adds frontmatter fields like allowed-tools plus invocation control; Codex declares tool dependencies and policy in a separate agents/openai.yaml). So NBI's existing skills UI (the Skills tab, SkillManager, the managed-skills reconciler) becomes agent-aware and manages the agent's own directory. This is a medium NBI-side change rather than a path swap: the skill scope axis today is user/project, not agent, and a single SkillManager is wired in. It is also spike-gated, because writing into ~/.agents/skills only helps if Codex's runtime discovers it under headless drive. The spike should confirm two things: that a Codex driven through codex-acp actually loads its skills directory, and that it tolerates the Claude-only allowed-tools field (marked experimental in the standard, and not documented by Codex) harmlessly rather than erroring. Mid-session skills hot-reload (the "Skills reloaded" banner) is a further open question, separate from on-disk management.
A single skill set shared identically across agents is deferred. Authoring one skill used transparently by both Claude and Codex is feasible given the shared standard, but the differing directories and per-vendor extensions make it a reconciler-and-mapping change with ongoing maintenance, and it is not required for either mode to have working skills. Out of scope for this work.
Rulesets remain the portable cross-agent instruction primitive, with a caveat. NBI injects rulesets (~/.jupyter/nbi/rules) into the LLM-provider chat path's system prompt, provider-agnostically. Note that Claude mode does not currently receive injected rulesets (its system prompt is built separately), so extending rulesets to the agent-mode layer would be new work, not parity with existing behavior. If we want one set of instructions that applies across agent modes (distinct from skills), rulesets are the cleaner path than a shared-skill bridge (Phase 3, optional).
Security considerations
ACP agents execute tools (edits, shell commands, reads), so this deserves its own heading:
Permission requests are cooperative. Approval routing exists for exec and edit tools via codex-acp, but the agent only may call session/request_permission: an agent in an auto policy, or one running a pure in-process tool, can act without asking. We must not present ACP-agent approvals with the same guarantee we give Claude until the spike maps the coverage.
The client-side backstop is real but partial. Filesystem writes and terminal commands route back through fs/* and terminal/* methods we control, a second chokepoint, but not full mediation of in-process tools.
Claude's managed-settings clamp has no ACP analog. The Claude path reads enterprise managed settings (disableBypassPermissionsMode, defaultMode) to clamp permission modes. ACP agents have no equivalent, so the framework must not assume that admin control surface exists for them.
Default to least privilege. Ship ACP agents off by default, behind the same policy gating as Claude, and make clear which agent is acting.
Alternatives considered
Use MCP everywhere instead of ACP. MCP standardizes tool providers, not the editor/agent control loop. It does not give tool-call streaming, per-tool permission requests, diff reporting, or session resume from the agent side. MCP and ACP are complementary (we pass MCP servers over ACP), not substitutes.
Put Claude on ACP too, for a single backend. Tempting for uniformity, but it migrates shipped, deeply integrated software onto a pre-1.0 protocol for no user-visible gain. Kept as a future option, not this work.
Wait for ACP to reach 1.0. The breadth it unlocks is available now, and the spike de-risks the version question before we commit framework code. Waiting forgoes the one integration that replaces several bespoke ones.
Keep launcher tiles for every non-Claude agent. The honest baseline; it works today but caps non-Claude agents at "opens a terminal," which is exactly what first-class Codex is asking to move past.
Write a second bespoke native integration for Codex (its official SDK). Viable, and the spike keeps it as the fallback. Rejected as the default because it does not generalize to a third or fourth agent, and ACP gives a uniform approval and streaming surface that a bespoke native integration would otherwise hand-build per agent.
Compatibility
No change for existing Claude users. Claude mode keeps claude_agent_sdk, the same UI, settings, and permission-mode selector. The eventual shared-layer extraction is a behind-the-scenes change with identical behavior; ACP agents are new, opt-in, and off by default.
Risks and caveats
ACP is pre-1.0. Protocol version is negotiated as integer v1; the core schema/Rust crate is ~0.14.x and still ships occasional breaking changes behind an unstable-to-stable promotion path. We would track a moving target.
The Python SDK trails the core schema (~0.10.x vs ~0.14.x). Confirming it exposes what we need is the top spike item.
Permission enforcement is cooperative (see Security).
ACP modes are not a permission-mode selector. ACP has a structured but agent-defined modes concept (session/set_mode, current_mode_update) distinct from request_permission; it is free-form, not the Default / Accept Edits / Plan / Bypass enum. That selector is Claude-native; we should not promise it behaves identically across ACP agents.
Inline-chat/autocomplete for a Codex mode is not free. A Codex mode's non-agent paths would want an OpenAI-style provider, but Codex-class models in this repo today require /responses routing (implemented only in the Copilot path), so the existing OpenAI-compatible adapter may not reach them without work. This is an assumption to validate, not a given.
Adapter dependency. Codex-over-ACP relies on Zed's codex-acp. We avoid this for Claude by staying native.
Maintenance surface. A shared layer plus an ACP backend (plus possibly a native Codex backend) is real surface. It only pays off if the shared layer genuinely unifies the UX.
Non-goals
No changes to the LLM provider layer.
No migration of Claude mode off claude_agent_sdk in this work.
No single shared skill set used identically across agents (Codex mode's own skills management is in scope; sharing one skill set between agents is not).
No commitment to every agent in the ACP registry; we light up the ones we choose.
Open questions for the team
Agreement that Claude stays native and ACP is the path for additional agents?
Is Codex's native-vs-ACP placement acceptable to decide by spike?
Preferred sequencing: ACP-as-second-concrete then extract (lower interface risk), or extract-from-Claude first (independent cleanup value, provisional interface)?
Appetite for tracking a pre-1.0 protocol given the breadth it buys?
Which agents beyond Codex to light up first (Gemini, opencode, Goose)?
Portable context via rulesets (Phase 3) as the framing, or is there demand for a real skills bridge?
Summary
Today NBI has one deeply integrated coding agent: Claude Code (Claude mode). Every other agent (OpenAI Codex, opencode, Pi, GitHub Copilot CLI) is a launcher tile that opens a terminal and runs a command. The goal is a first-class Codex mode that behaves like Claude mode for the core agent UX: the agent drives the chat panel with the same streaming tool-call cards with diffs, per-tool approval, session resume, a settings tab, and skills management, instead of a terminal tile. Some Claude-specific surfaces (the permission-mode selector, plan-mode approval, the AskUserQuestion form, the heartbeat stall indicator, and managed-settings clamping) have no guaranteed analog and are tracked as open parity questions below. And a path to additional agent modes that behave the same way, without rebuilding the full depth of Claude mode for each one.
This proposes a shared agent-mode framework inside NBI that owns the agent UX (chat routing, tool-call cards, permission/confirmation cards, session resume, settings), fed by two kinds of backend:
claude_agent_sdk, with no behavior change. It is shipped, deeply integrated, and just gained the permission-mode selector. Migrating it would be regression risk for no change a user would notice.The framework and ACP are the means; the deliverable is agent modes that feel like Claude mode, the first of which is Codex.
What this issue asks: agreement on the direction (Claude native, ACP for additional agents) and a green light for the Phase 0 spike. Phases 1 and beyond would be split into their own issues once the spike reports.
Goal: a first-class Codex mode (and more like it)
The user-facing target is that switching NBI into Codex mode feels like Claude mode for the core agent UX: the agent drives the chat panel with streaming, tool-call cards, approvals, and resume, not a terminal tile. Each Claude-mode surface maps as follows. "Mechanism" is what delivers it; "Notes" is the honest caveat, including roughly how much NBI-side work it is.
session/prompt+session/update(or native SDK)tool_call/tool_call_updatewith diff content (path/oldText/newText); the React cards are reusedsession/request_permission(allow_once/allow_always/reject_once/reject_always)allow_alwaysmeans "remember" (persisted), which is not Claude's "approve for this session", so that label changesagent_thought_chunk,available_commands_updatesession/loadfor the resume itself; the listing and previews are NBI-sidecodex_settings, modeled onclaude_settingsclaude_settingsthreads through policy application, string overrides, credential scrubbing, ~10 capabilities fields, a save path, and a ~460-line React panel; mechanical but not a dict copysession/set_mode), not this enumKnown parity gaps (surfaces with no guaranteed analog). These are real Claude-mode behaviors a user notices, and parity for them is uncertain, not assumed:
ExitPlanModetool. Whether Codex exposes an equivalent approval step is unverified.request_permissionis allow/reject only, with no portable channel for an arbitrary question form.session/cancelnotification, not a subprocess-tree kill), and the TodoWrite task-list and @-mention context surfaces.So "behaves like Claude mode" is accurate for the core agent loop and the reusable React surfaces, and the framework plus ACP is what makes that affordable. It is not full parity: the NBI-side surfaces are each a medium build modeled on Claude's, and the gaps above are genuine open questions, not free wins.
Background: two integration layers, not one
NBI already has two separate "connect to a thing" layers, and this proposal only touches the second:
claudebinary.)ACP is about the agent layer, not the provider adapters.
Current state
notebook_intelligence/claude.py, 1,819 lines plus UI): forces chat routing toClaudeCodeChatParticipant, runs theclaudebinary as a streaming subprocess viaclaude_agent_sdk, gates tool calls through acan_use_toolpermission handler (custom_permission_handler), renders tool-call cards with inline diffs, supports plan mode and the permission-mode selector, resumes sessions from~/.claude/projects, and exposes a settings tab and capabilities fields.registerAgentCliLauncherinsrc/index.ts). Each resolves a CLI path and, on click, opens a Jupyter terminal and runs the command. This is good, deliberate design for "give me a terminal with the agent running," but it caps every non-Claude agent at that: no in-panel tool-call cards, no per-tool approvals, no session resume. First-class Codex is a request to move past the tile.So "parity between Claude mode and Codex mode" is not tweaking two peers; it is building a second agent mode. ACP is what makes that affordable and extensible to more agents.
Why ACP
ACP standardizes the editor/agent boundary the way LSP standardized editor/language-server. The pieces that matter for NBI (all verified against the v1 schema):
session/request_permissionis a client method the agent calls before running a tool, with standardized option kinds (allow_once/allow_always/reject_once/reject_always; noteallow_alwaysmeans remember the choice, not scope it to the session) and the tool call attached. This is the closest standardized analog to the confirmation card we built for Claude. It is the thing that lets a non-Claude agent reuse our approval UX instead of each one needing a bespoke handler.tool_call/tool_call_updatecarry id, title, kind (read / edit / delete / move / search / execute / think / fetch / switch_mode / other), status (pending / in_progress / completed / failed), and content including file diffs (path,oldText,newText) and live terminal output. Enough to drive tool-call cards with diffs.session/load(capability-gated).session/new.agent_thought_chunkandavailable_commands_update.@zed-industries/claude-code-acp; Codex via@zed-industries/codex-acp; GitHub Copilot is in public preview. One ACP backend lights up that set.We would drive ACP from the Jupyter server extension in Python. The Python SDK (
agent-client-protocolon PyPI, currently ~0.10.x, community-maintained under theagentclientprotocolorg and treated as canonical by the ACP docs) is the surface; it trails the core schema (Rust crate ~0.14.x), which is a real maturity caveat the spike must check.The shape: shared UX, backends translate
The framework owns everything that should look the same regardless of agent: forced chat routing while an agent mode is active, the client lifecycle (connect, stream, cancel, reconnect), tool-call cards, the permission/confirmation cards and approve/reject flow, the session-resume picker, and the settings/capabilities scaffold. Backends translate to a specific agent: a
claudebackend (claude_agent_sdk) and anacpbackend (the ACP Python SDK, parameterized by which agent binary to launch).Two design realities make this less symmetric than "two backends behind one interface," and the framework must be designed around them rather than discovering them late:
claude_agent_sdk, the SDK owns the loop and calls ourcan_use_toolcoroutine, which renders the card, blocks until the user answers, and returns the decision. With ACP, we own the JSON-RPC loop; the agent sends asession/request_permissionrequest that arrives on a separate channel from thetool_callstream and must be correlated by tool-call id. The honest seam is therefore not "a callback both backends implement." The framework should own the orchestration (render card, await the user, map the answer), and a backend should expose only "normalize this native tool event into a common ToolCall" and "apply this common Decision." That interface has to be shaped around ACP's request/response model (the stricter one), because the callback model retrofits onto request/response more easily than the reverse.can_use_toolcan returnupdated_input, rewriting the tool's input (NBI uses this forAskUserQuestionandExitPlanModeplan approval). ACP'srequest_permissionis allow/reject only. The shared Decision type must still carryupdated_inputeven though only the native backend can honor it, or we silently drop a Claude capability we already ship. This is a lowest-common-denominator trap to avoid by design.The point of the framework is to write the cards and the approval flow once. The point of naming the above is that the interface must be derived from both models, not extracted from Claude alone.
Claude native, Codex by spike
@zed-industries/claude-code-acpadapter exists and is maintained, so Claude could ride ACP later, which makes this a reversible decision. But migrating shipped, deeply integrated software onto a pre-1.0 protocol now is risk for no behavior change a user would notice. Keepclaude_agent_sdk.codex-acpadapter already maps onto ACP'srequest_permissionfor exec and patch-apply tools. So the value ACP adds is a uniform approval surface across agents, not that Codex otherwise lacks one. Thecodex-acpadapter is real, Zed-maintained, and shipping every few weeks (it is what Zed itself uses for Codex). The open question is therefore not whether per-tool approval routes through ACP (it demonstrably does for those categories) but which tool categories surface for approval versus skip it, and under which approval policy. The spike maps that.Phase 0: the spike
A standalone Python script (outside the extension) that drives Codex through
codex-acpusing the ACP Python SDK. Done when all of these are answered:codex-acpalready routes exec and patch-apply approvals throughsession/request_permission. Map which tool categories surface for approval (exec, patch-apply, MCP, granularrequest_permissions) versus which skip it (pure in-process tools, or auto/never approval policies), and how that shifts with the approval policy. The aim is to know exactly what we can promise users about confirmation, not to discover whether approval routes at all.agent-client-protocol(trailing the core schema) actually exposesession/request_permission,tool_call/tool_call_updatewith diffs,session/load, andsession/newwithmcpServers? Pin exact package and version, and noteopenai-codex(the official Codex Python SDK, currently beta) only if we end up comparing the native path.wait_for_chat_user_inputpolling a signal while the websocket handler answers on the Tornado loop). A green light from a plainasyncio.run()script proves the protocol works but says little about whether it survives the extension's threading model, which is where these integrations actually fail.fs/*,terminal/*) concurrently with streaming tool updates, so the ACP backend is a bidirectional server, not a "send query, read stream" client like the Claude SDK hides. Confirm the Python SDK's loop services concurrent inbound requests cleanly.create_sdk_mcp_server). ACP'ssession/newmcpServersexpects stdio or socket servers, so the same tools likely need to run as a real stdio MCP server for ACP. Confirm we can pass and invoke them.codex-acpactually loads skills from its skills directory (~/.agents/skills), so NBI's skills management has the same effect for Codex mode that it has for Claude mode.The spike de-risks the Codex native-vs-ACP decision, the protocol adequacy, and (most importantly) the embedding, on an agent we actually care about, before framework code is written.
Phased plan and sequencing
The order matters and involves a real trade. The lower-risk order is spike, then a working ACP integration as a second concrete implementation alongside Claude, then extract the shared framework once two real implementations exist to factor against. Extracting the framework from Claude alone first is tempting (it has independent value as cleaner code and is reviewable in isolation) but risks encoding Claude's inverted, callback-driven assumptions as the interface and then bending ACP to fit. We propose to treat any interface drawn before the ACP integration exists as provisional.
(If the team prefers the extract-first order for the independent-cleanup value, that is defensible, but the interface should still be treated as provisional until the ACP integration lands.)
Skills
Two different things hide under "skills," and only one is deferred:
SKILL.mddirectory with requirednameanddescriptionfrontmatter, and both reference the standard in their own first-party docs. The core format is genuinely shared; the differences are at the edges. The directory differs (Claude under~/.claude/skills, Codex under~/.agents/skills), and each vendor layers its own extensions (Claude adds frontmatter fields likeallowed-toolsplus invocation control; Codex declares tool dependencies and policy in a separateagents/openai.yaml). So NBI's existing skills UI (the Skills tab,SkillManager, the managed-skills reconciler) becomes agent-aware and manages the agent's own directory. This is a medium NBI-side change rather than a path swap: the skill scope axis today is user/project, not agent, and a singleSkillManageris wired in. It is also spike-gated, because writing into~/.agents/skillsonly helps if Codex's runtime discovers it under headless drive. The spike should confirm two things: that a Codex driven throughcodex-acpactually loads its skills directory, and that it tolerates the Claude-onlyallowed-toolsfield (marked experimental in the standard, and not documented by Codex) harmlessly rather than erroring. Mid-session skills hot-reload (the "Skills reloaded" banner) is a further open question, separate from on-disk management.Rulesets remain the portable cross-agent instruction primitive, with a caveat. NBI injects rulesets (
~/.jupyter/nbi/rules) into the LLM-provider chat path's system prompt, provider-agnostically. Note that Claude mode does not currently receive injected rulesets (its system prompt is built separately), so extending rulesets to the agent-mode layer would be new work, not parity with existing behavior. If we want one set of instructions that applies across agent modes (distinct from skills), rulesets are the cleaner path than a shared-skill bridge (Phase 3, optional).Security considerations
ACP agents execute tools (edits, shell commands, reads), so this deserves its own heading:
codex-acp, but the agent only may callsession/request_permission: an agent in an auto policy, or one running a pure in-process tool, can act without asking. We must not present ACP-agent approvals with the same guarantee we give Claude until the spike maps the coverage.fs/*andterminal/*methods we control, a second chokepoint, but not full mediation of in-process tools.disableBypassPermissionsMode,defaultMode) to clamp permission modes. ACP agents have no equivalent, so the framework must not assume that admin control surface exists for them.Alternatives considered
Compatibility
No change for existing Claude users. Claude mode keeps
claude_agent_sdk, the same UI, settings, and permission-mode selector. The eventual shared-layer extraction is a behind-the-scenes change with identical behavior; ACP agents are new, opt-in, and off by default.Risks and caveats
session/set_mode,current_mode_update) distinct fromrequest_permission; it is free-form, not the Default / Accept Edits / Plan / Bypass enum. That selector is Claude-native; we should not promise it behaves identically across ACP agents./responsesrouting (implemented only in the Copilot path), so the existing OpenAI-compatible adapter may not reach them without work. This is an assumption to validate, not a given.codex-acp. We avoid this for Claude by staying native.Non-goals
claude_agent_sdkin this work.Open questions for the team