Skip to content

Conversation

@zxzinn
Copy link
Contributor

@zxzinn zxzinn commented Dec 11, 2025

Description

This PR exposes thinking/reasoning content from Claude 4 Sonnet with extended thinking mode through additional_kwargs["thinking_delta"] during streaming in BedrockConverse.

Problem

When using BedrockConverse with Claude 4 Sonnet (model ID: us.anthropic.claude-sonnet-4-20250514-v1:0) with thinking mode enabled, the thinking/reasoning content is only available in the raw response object via event.raw['contentBlockDelta']['delta']['reasoningContent']['text']. This makes it difficult for downstream consumers (like FunctionAgent) to access thinking content in a standard way.

Solution

Store thinking delta in ChatResponse.additional_kwargs["thinking_delta"] during streaming, following the existing pattern used by other LLMs for metadata like tool_calls and annotations.

Key changes:

  • Modified stream_chat() to populate additional_kwargs["thinking_delta"] when reasoning content is present
  • Modified astream_chat() with the same logic
  • Thinking content is still accumulated in ThinkingBlock in the final message
  • Added comprehensive unit tests

Example Usage

Basic Streaming with Thinking Content

from llama_index.llms.bedrock_converse import BedrockConverse
from llama_index.core.base.llms.types import ChatMessage, MessageRole

llm = BedrockConverse(
    model="us.anthropic.claude-sonnet-4-20250514-v1:0",
    thinking={"type": "enabled", "budget_tokens": 1024},
    temperature=1,  # Required for thinking mode
)

messages = [ChatMessage(role=MessageRole.USER, content="What is 15 + 27?")]

# Stream and access thinking deltas in real-time
for response in llm.stream_chat(messages):
    # Access thinking content (reasoning process)
    if "thinking_delta" in response.additional_kwargs:
        thinking = response.additional_kwargs["thinking_delta"]
        print(f"Thinking: {thinking}", end="", flush=True)

    # Access text content (final answer)
    if response.delta:
        print(f"Answer: {response.delta}", end="", flush=True)

Collecting Complete Thinking Content

from llama_index.core.base.llms.types import ThinkingBlock

responses = list(llm.stream_chat(messages))

# Collect all thinking deltas
thinking_content = [
    r.additional_kwargs["thinking_delta"]
    for r in responses
    if "thinking_delta" in r.additional_kwargs
]
full_thinking = "".join(thinking_content)

# Or access accumulated thinking from final message
final_response = responses[-1]
thinking_blocks = [
    b for b in final_response.message.blocks
    if isinstance(b, ThinkingBlock)
]
if thinking_blocks:
    accumulated_thinking = thinking_blocks[0].content

Async Streaming

async for response in await llm.astream_chat(messages):
    if "thinking_delta" in response.additional_kwargs:
        print(f"Thinking: {response.additional_kwargs['thinking_delta']}")
    if response.delta:
        print(f"Answer: {response.delta}")

Important Notes:

  • temperature=1 is required when using extended thinking mode
  • thinking_delta and delta are mutually exclusive in each response
  • Final message's ThinkingBlock contains the complete accumulated thinking content

New Package?

  • Yes
  • No

Version Bump?

  • Yes (0.12.2 → 0.12.3)
  • No

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

  • I added new unit tests to cover this change
    • Unit tests (tests/test_thinking_delta.py):
      • test_thinking_delta_populated_in_stream_chat: Verifies thinking_delta is correctly populated in additional_kwargs
      • test_thinking_delta_none_for_non_thinking_content: Ensures None for regular text without thinking
      • test_thinking_block_in_message_blocks: Validates ThinkingBlock accumulation in final message
    • Integration test (tests/test_llms_bedrock_converse.py):
      • test_bedrock_converse_thinking_delta_in_additional_kwargs: Real AWS Bedrock API test verifying thinking_delta in both sync and async streaming

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added Google Colab support for the newly added notebooks.
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran uv run make format; uv run make lint to appease the lint gods

Add thinking_delta field to ChatResponseAsyncGen to capture
reasoning content deltas during streaming. Extract reasoning text
into separate variable for clarity and populate thinking_delta in
stream responses. Update version to 0.12.3 and add comprehensive
tests for thinking delta functionality.
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Dec 11, 2025
message: ChatMessage
raw: Optional[Any] = None
delta: Optional[str] = None
thinking_delta: Optional[str] = Field(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Id rather not add this. Lets abuse additional_kwargs for now like other llms do. At some point we need to expose streaming content blocks instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@logan-markewich
I've updated the implementation and added the usage example in the PR description.

Move thinking_delta from a dedicated field to additional_kwargs
to simplify the ChatResponse structure. Update bedrock_converse
streaming methods and tests to access thinking_delta via
additional_kwargs instead of the direct field.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants