Skip to content

[follow-up #179] claude-agent-sdk runtime multimodal — wire images via AsyncIterable<SDKUserMessage> #259

Description

@s2agi

Context

In #179 M5b 必改2-C (commit b875a16 on feat/179-feishu-agent-sdk-channel), the claude-agent-sdk runtime now accepts images?: string[] in processWithClaude for symmetry with the codex / grok branches, but downgrades non-empty images to a text-only prompt + warn line (mirrors the Grok behavior already in the codebase).

The downgrade was chosen to keep M5b's blast radius scoped — the proper multimodal fix touches the SDK call shape itself.

What this issue tracks

Real multimodal wiring for claude-agent-sdk: send images as actual content blocks the LLM can see.

Sketch

claude-agent-sdk query({prompt}) accepts prompt: string | AsyncIterable<SDKUserMessage>. Each SDKUserMessage carries a MessageParam (Anthropic spec) whose content can be an array of blocks including:

{ type: "image", source: { type: "base64", media_type: "image/png", data: "<b64>" } }
{ type: "text",  text: "..." }

processWithClaude would switch from query({prompt: task}) to query({prompt: asyncIter([{type:"user", message:{role:"user", content:[textBlock, ...imageBlocks]}}])}) when images are non-empty.

Blocking work to verify first

  1. Vendor multimodal support: every Anthropic-compat endpoint the wizard currently lists (intern / MiniMax / mimo / deepseek / Anthropic native) needs a verify pass — does it accept image blocks? deepseek-v4-pro via https://api.deepseek.com/anthropic is the user-facing concern; verify via real curl with a small base64 image before landing.
  2. Cross-runtime regression: all current processWithClaude callers go through think(); flipping the prompt shape changes the SDK call signature for the common path, not just feishu. Need a verify pass on commhub-inbox + /loop wakes + standalone agent-node smoke.
  3. Memory / size: base64-encoded image data lives in memory during the SDK call. Uploads are already capped at 12MB per file ([feature][P1] hub 文件上传端点 POST /api/upload + /api/task attachments REST 对齐(#220 APP 图片/文件发送依赖) #221) so a single image is bounded; a multi-image message scales linearly. Confirm no SDK-side limit hits.

Acceptance criteria

  • processWithClaude(task, from, [imagePath]) with non-empty images → LLM sees the image and can describe / reason about it (verify with a "what's in this image?" probe).
  • Existing text-only path byte-identical (query({prompt: task}) shape preserved when images is empty).
  • Per-vendor verification matrix (which vendors accept image blocks; warn-and-downgrade for vendors that don't).
  • Refresh [feature][roadmap] Agent Network IM 兼容层 — 接入飞书 / WhatsApp / 企微 / Slack 等 IM 平台 #179 quickstart doc — drop the "agent-sdk runtime currently sends text-only" caveat once landed.

Refs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions