Skip to content

fix(esc): mid-stream cancellation for OpenAI-compatible + Minimax providers#145

Merged
ericleepi314 merged 1 commit into
mainfrom
fix/esc-cancel-openai-compatible
May 15, 2026
Merged

fix(esc): mid-stream cancellation for OpenAI-compatible + Minimax providers#145
ericleepi314 merged 1 commit into
mainfrom
fix/esc-cancel-openai-compatible

Conversation

@ericleepi314
Copy link
Copy Markdown
Collaborator

Summary

User-reported symptom

After #144 merged, a user testing with LiteLLM → Anthropic Claude Opus 4.7 reported ESC still took ~10s during the model's "Thinking…" phase. The call path went through openai_compatible.py (LiteLLM exposes an OpenAI-compatible API), not anthropic_provider.py, so #144's listener never registered. This PR closes that gap.

Changes

  • src/providers/openai_compatible.py — two defenses in chat_stream_response:

    • Response-close listener: registered on abort_signal via add_listener(..., once=True). Calls stream.response.close() to close the underlying httpx socket. The SDK's blocking next-chunk read raises immediately. Handles the user's exact case (long gap between chunks during extended thinking / tool_use generation).
    • In-loop abort check: if abort_signal.aborted: break at the top of each for chunk in stream: iteration. Catches the SDK-prefetched-chunks case where the listener's close lands one iteration late.
    • Register-then-recheck race-safe ordering (matches fix(esc): cancel streaming API call when abort signal trips #144's pattern).
    • Signal-state-authoritative exception translation in the except Exception block.
    • finally block detaches the listener so long-lived controllers don't accumulate listeners.
  • src/providers/minimax_provider.py — Minimax uses the anthropic SDK against its compatible endpoint, so it gets the AnthropicProvider treatment (response-close listener; no in-loop check needed because the with ... as stream: only exposes text_stream).

  • tests/test_openai_compat_abort_signal.py (new, 6 tests):

    • Pre-abort fast-path skips client.chat.completions.create (leaf-level assert_not_called())
    • Mid-stream close: synthetic 500ms stream + mid-stream trip returns within 1s, verifies stream.response.close() was called
    • Load-bearing in-loop check: passes an on_text_chunk callback, asserts seen == ["first"] not ["first", "second"]. Mutation-tested by deleting the in-loop check and watching the test fail with the exact expected message.
    • Normal-completion regression check
    • abort_signal=None legacy parity
    • Listener detachment after normal completion
  • tests/test_minimax_abort_signal.py (new, 4 tests): same shape as the Anthropic tests, with _ensure_client as the fast-path sentinel.

Test plan

  • 6 new OpenAI-compat tests pass (1 mutation-verified load-bearing)
  • 4 new Minimax tests pass
  • 15 total provider abort tests pass (5 Anthropic + 4 Minimax + 6 OpenAI-compat)
  • 4419 broader related tests pass
  • Critic subagent review: APPROVE (after fixing a stale docstring on this round and a non-load-bearing in-loop test on the prior round)
  • Manual: kick off a long prompt on the LiteLLM-proxied stack, press ESC during the model's "Thinking…" phase, should unwind in ~50ms instead of 10s

Follow-up (deferred)

Three-way duplication of the response-close-listener pattern across AnthropicProvider, MinimaxProvider, and OpenAICompatibleProvider. The three contexts differ enough (Anthropic has the watchdog + non-streaming fallback, OpenAI-compat has bare iterator + in-loop check, Minimax has with-block + get_final_message) that premature extraction would either grow the helper to a 4-knob API or leak abstraction. Will file as a separate refactor PR.

🤖 Generated with Claude Code

…viders

PR #144 added abort-signal-aware streaming for ``AnthropicProvider``,
but ``OpenAICompatibleProvider`` and ``MinimaxProvider`` got only the
pre-call fast-path. Users running through LiteLLM / GLM / OpenAI /
DeepSeek — the most common "OpenAI-compatible proxy → Claude" stack —
still saw ESC wait the full model latency before the post-API abort
check fired. Same 20+ second symptom from before #144.

Port the response-close listener pattern from #144's
AnthropicProvider:

* Register a listener on the abort signal that calls
  ``stream.response.close()`` to close the underlying HTTP socket.
  Closes interrupt the SDK's blocking next-chunk read so the iterator
  raises immediately, even when the model is in a multi-second gap
  between chunks (extended thinking, tool_use generation).
* For OpenAI-compatible providers, additionally add an in-loop
  ``if abort_signal.aborted: break`` check at the top of each
  ``for chunk in stream`` iteration. Covers the case where chunks
  arrive back-to-back fast enough that the listener's close lands
  one iteration late, or where the SDK has already prefetched
  chunks past the close point.
* Signal-state-authoritative exception translation in the
  ``except Exception`` block — different SDK versions raise
  different exception classes when the response is closed
  mid-read, so the signal is the only stable abort indicator.
* Register-then-recheck ordering closes the sub-microsecond race
  where ``_fire`` can snapshot the listener list and silently drop
  a freshly-appended listener.
* ``finally`` block detaches the listener so long-lived
  controllers (the REPL engine's, reused across many turns) don't
  accumulate dead listeners.

Minimax wraps the anthropic SDK against its compatible endpoint,
so it gets the AnthropicProvider treatment (no in-loop check — the
``with client.messages.stream(...) as stream:`` pattern only exposes
``text_stream``, not a generic iterator).

Ten regression tests pin the contract:

* ``test_openai_compat_abort_signal.py`` (6 tests) — pre-abort
  fast-path with leaf-level ``assert_not_called()``, mid-stream
  close via response.close + timing bound, **load-bearing**
  in-loop check (asserts ``on_text_chunk`` saw only "first" not
  "second" — mutation-verified by deleting the in-loop check and
  watching the test fail), normal-completion regression check,
  ``abort_signal=None`` legacy parity, listener detachment.
* ``test_minimax_abort_signal.py`` (4 tests) — same shape as
  AnthropicProvider, with ``_ensure_client`` as the fast-path
  sentinel.

Three-way duplication of the close-listener pattern (Anthropic,
Minimax, OpenAI-compat) is acknowledged. Extracting a shared helper
is left as a follow-up — the three providers' surrounding contexts
differ enough (Anthropic has the watchdog + non-streaming fallback,
OpenAI-compat has bare-iterator semantics, Minimax has the
``with``-block + ``get_final_message``) that a premature extraction
would either grow the helper to a 4-knob API or leak abstraction.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant