Skip to content

fix(frontend): make disconnect indicator clickable to reconnect#9410

Open
bestvibes wants to merge 2 commits intomarimo-team:mainfrom
bestvibes:fix/clickable-disconnect-indicator
Open

fix(frontend): make disconnect indicator clickable to reconnect#9410
bestvibes wants to merge 2 commits intomarimo-team:mainfrom
bestvibes:fix/clickable-disconnect-indicator

Conversation

@bestvibes
Copy link
Copy Markdown

📝 Summary

Closes #9385

This pull request was primarily authored by a coding agent.

Problem

When marimo is accessed over an SSH-forwarded tunnel (or any flaky link), the kernel WebSocket drops on tunnel reset or laptop sleep. The browser shows the broken-link indicator in the top-left, and the only recovery is a full page reload — disruptive because it resets scroll position, cell focus, and any unsaved client state, despite the kernel surviving the WS drop in edit mode and the server already supporting session resume on a fresh connection (marimo/_server/api/endpoints/ws/ws_session_connector.py).

The frontend was forcing the reload because of two compounding issues:

  1. One-shot reconnect gate: useMarimoKernelConnection.shouldTryReconnecting only resets on a successful onOpen. After partysocket's 10-retry budget is exhausted, the gate stays closed and ws.reconnect() is never called again from our code.
  2. Upstream partysocket bug: in partysocket@1.1.10, _connectLock leaks in the _connect() max-retries early-return — even if the gate were reset, ws.reconnect() would silently no-op because the lock is permanently held. Fixed upstream in partykit#322, released as partysocket 1.1.13.
  3. Stuck spinner: the onClose default branch always sets state to CONNECTING, even after partysocket has exhausted its retry budget — the spinner outlives the actual retry loop with no escape.

The broken-link icon also gives the visual impression of being clickable but is a no-op <div> today.

What changed

This is two commits — a dependency bump (precondition) followed by the frontend fix.

chore(frontend): bump partysocket 1.1.10 → 1.1.13

Patch-version bump pulling in the upstream _connectLock fix. No API-surface changes — verified that retryCount, reconnect(), and maxRetries signatures are identical between 1.1.10 and 1.1.13. The lockfile diff is exactly 6 lines, all about partysocket.

fix(frontend): make disconnect indicator clickable to reconnect

  • Disconnect indicator becomes a real <button> (status.tsx). Native button gives Enter/Space activation and accessibility; previously it was a <div> with a tooltip. When the connection is in a recoverable CLOSED state, clicking calls a new reconnect() exposed by useMarimoKernelConnection. When the connection is CLOSED for any other reason, the button is disabled.
  • reconnect() probes the runtime first via the existing runtimeManager.isHealthy() (/health). If the server is genuinely unreachable (e.g. the SSH tunnel is dead), we transition straight back to CLOSED + KERNEL_DISCONNECTED in ~1s instead of burning partysocket's full retry budget on the silent click. The probe also resets the one-shot reconnect gate so subsequent automatic retries work.
  • Surface CLOSED when partysocket has given up. In the onClose default branch, when ws.retryCount >= MAX_RETRIES, transition to CLOSED + KERNEL_DISCONNECTED instead of CONNECTING. This makes the disconnect icon reappear and become clickable again, so the user can manually retry without reloading. Without this, the spinner stayed up forever even after partysocket had stopped trying.
  • Refactor: the inline switch on event.reason in onClose is extracted into a pure classifyCloseEvent helper (close-handler.ts) so the regression behavior can be unit-tested. The hook is now thin glue: classify → setConnection → optionally close transport or call tryReconnecting.
  • Tests: close-handler.test.ts (15 cases) covers each terminal reason plus the retry-budget-exhaustion path. status.test.tsx (3 cases) covers the click → callback wiring and the disabled state.

Heads-up for review

  • runtimeManager.isHealthy() has a side effect (setDOMBaseUri() on success). It was previously only triggered during initial startup; now it can fire on each manual reconnect. Almost certainly a no-op in practice (URL hasn't changed) but flagging in case you want it gated.
  • IConnectionTransport.retryCount was added to the transport interface so the close handler can detect partysocket give-up. This leaks a partysocket-specific concept into a transport-agnostic interface; an alternative would be to have the transport emit a synthetic close event with a "gave-up" reason. Open to that direction if preferred — current approach was chosen for minimal diff.
  • No automatic-behavior changes for users on a healthy connection. The only paths that exercise the new code are user-initiated clicks and the previously-stuck post-give-up case.
  • No new user-facing strings. "kernel not found" / tooltip text reuse existing copy; only new string is the tooltip suffix "— click to reconnect".

📋 Pre-Review Checklist

  • For large changes, or changes that affect the public API: this change was discussed or approved through an issue, on Discord, or the community discussions (Please provide a link if applicable).
  • Any AI generated code has been reviewed line-by-line by the human PR author, who stands by it.
  • Video or media evidence is provided for any visual changes (optional).

✅ Merge Checklist

  • I have read the contributor guidelines.
  • Documentation has been updated where applicable, including docstrings for API changes.
  • Tests have been added for the changes made.

I have read the CLA Document and I hereby sign the CLA

Pulls in upstream fix partykit/partykit#322: `_connectLock` was leaked
in the max-retries early-return of `_connect()`, so once partysocket
exhausted its retry budget every subsequent `reconnect()` call was a
silent no-op because the lock was permanently held. This made any
manual or programmatic reconnect attempt fail after a long outage,
forcing a full page reload to recover.

Patch-version bump, no API surface changes.
When marimo runs over an SSH-forwarded tunnel and the WebSocket drops
(tunnel reset, laptop sleep, NAT idle reaping), the only way to
recover today is a full page reload. The kernel survives the WS drop
in edit mode and the server already supports session resume on a
fresh connection — only the frontend was forcing the reload.

- Disconnect indicator is now a `<button>` when the connection is in
  a recoverable CLOSED state. Click invokes a new `reconnect()`
  exposed by `useMarimoKernelConnection` that resets the one-shot
  reconnect gate and kicks off `ws.reconnect()`.
- `reconnect()` probes the runtime via `runtimeManager.isHealthy()`
  before reopening the socket so a click while the tunnel is dead
  fails in ~1s instead of burning partysocket's full retry budget.
- The `onClose` default branch now transitions to CLOSED +
  KERNEL_DISCONNECTED when partysocket has exhausted `maxRetries`,
  instead of leaving the UI stuck on the spinner indefinitely.
- Close-event classification extracted into a pure helper
  (`classifyCloseEvent`) with unit tests covering each terminal
  reason and the retry-budget exhaustion path.

Closes marimo-team#9385
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 28, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
marimo-docs Ready Ready Preview, Comment Apr 28, 2026 2:54am

Request Review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 28, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@bestvibes
Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

@bestvibes bestvibes marked this pull request as ready for review May 6, 2026 02:23
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 13 files

Architecture diagram
sequenceDiagram
    participant User as User (click)
    participant Status as StatusOverlay (status.tsx)
    participant Hook as useMarimoKernelConnection
    participant CloseHandler as close-handler.ts
    participant Transport as ReconnectingWebSocket (partysocket)
    participant Runtime as Runtime Manager (/health)
    participant Backend as Backend Server

    Note over User,Backend: Manual Reconnect Flow (NEW)

    User->>Status: click disconnect icon (<button>)
    Status->>Status: NEW: button is rendered, onReconnect passed
    Status->>Hook: NEW: reconnect()

    Hook->>Hook: NEW: shouldTryReconnecting = true
    Hook->>Hook: NEW: setConnection(CONNECTING)
    Hook->>Runtime: NEW: isHealthy() → GET /health

    alt Health check succeeds
        Runtime-->>Hook: 200 OK
        Note over Runtime,Hook: CHANGED: setDOMBaseUri() side effect
        Hook->>Transport: NEW: ws.reconnect()
        Transport->>Transport: NEW: partysocket bugfix (1.1.13) unlocks _connectLock
        Transport->>Backend: WebSocket upgrade
        Backend-->>Transport: onOpen
        Transport-->>Hook: onOpen callback
        Hook->>Hook: shouldTryReconnecting = false
        Hook-->>Status: setConnection(OPEN)
    else Health check fails (server unreachable)
        Runtime-->>Hook: 4xx/5xx/timeout
        Hook->>Hook: NEW: setConnection(CLOSED, KERNEL_DISCONNECTED)
        Hook-->>Status: button stays disabled, icon shown
    end

    Note over Transport,Hook: Automatic Retry with Budget Tracking (CHANGED)

    Transport->>Backend: WebSocket close event
    Transport-->>Hook: onClose(event)
    Hook->>CloseHandler: classifyCloseEvent(event, {retryCount, maxRetries})

    alt Terminal reason (MARIMO_ALREADY_CONNECTED, MARIMO_SHUTDOWN, etc.)
        CloseHandler-->>Hook: { kind: "terminal", status: CLOSED, closeTransport }
        Hook->>Transport: ws.close() to prevent further reconnects
        Hook-->>Status: setConnection(CLOSED with appropriate reason)
    else Retry budget exhausted (retryCount >= MAX_RETRIES=10)
        CloseHandler-->>Hook: { kind: "gave-up", status: CLOSED, KERNEL_DISCONNECTED }
        Hook-->>Status: CHANGED: setConnection(CLOSED) instead of CONNECTING
        Note over Hook,Status: Spinner replaced by clickable disconnect icon
    else Transient close (retries remaining)
        CloseHandler-->>Hook: { kind: "retry", status: CONNECTING }
        Hook->>Transport: tryReconnecting(code, reason)
        Transport->>Backend: automatic partysocket retry
    end

    Note over Status,User: Disconnect Icon Behavior (CHANGED)

    alt Connection is CLOSED + KERNEL_DISCONNECTED + onReconnect provided
        Status-->>User: NEW: <button> with "click to reconnect" tooltip
    else Connection is CLOSED + no onReconnect
        Status-->>User: disabled <button>
    else Connection is CLOSED + ALREADY_RUNNING (canTakeover)
        Status-->>User: LockedIcon (no reconnect button)
    end
Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Disconnect indicator unclickable; reconnect requires page reload after WS retries exhausted

1 participant