Skip to content

refactor: handoff test#56

Merged
billybonks merged 4 commits into
mainfrom
refactor/handoff-test
May 18, 2026
Merged

refactor: handoff test#56
billybonks merged 4 commits into
mainfrom
refactor/handoff-test

Conversation

@billybonks
Copy link
Copy Markdown
Contributor

simplify and modernize the provider integration surface, especially around streaming handoff tests, while cleaning up legacy gateway/test code and improving provider option mapping.

all modern llms should support stream test but they may not do reasoning
since we have a reasoning test we can do that seperately
i want the hand off tests to automatically test the ouput of our generated
pairs

in order to do that we record the result and the handoff test will take all the results
and test all the messages to each pair
currently if we add a provider we do a lot of steps in testing to
add them the existing test and include them ot the handoff test

this pr makes it so that the handoff test relies on the provider test
Copy link
Copy Markdown

@managerbot-app managerbot-app Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed Changes

Managerbot reviewed 9 out of 65 changed files in this pull request and generated 8 comments.

Files not reviewed (56)
  • .pi/skills/live-provider-testing/SKILL.md
  • test/fixtures/handoff/stream_image_test/anthropic_apikey_messages_claude-sonnet-4-20250514.json
  • test/fixtures/handoff/stream_image_test/anthropic_oauth_messages_claude-sonnet-4-20250514.json
  • test/fixtures/handoff/stream_image_test/openai_apikey_completions_gpt-5.1.json
  • test/fixtures/handoff/stream_image_test/openai_apikey_responses_gpt-5.4.json
  • test/fixtures/handoff/stream_image_test/openai_oauth_codex_gpt-5.4.json
  • test/fixtures/handoff/stream_reasoning_test/anthropic_apikey_messages_claude-sonnet-4-20250514.json
  • test/fixtures/handoff/stream_reasoning_test/anthropic_oauth_messages_claude-sonnet-4-20250514.json
  • test/fixtures/handoff/stream_reasoning_test/openai_apikey_completions_gpt-5.1.json
  • test/fixtures/handoff/stream_reasoning_test/openai_apikey_responses_gpt-5.4.json
  • test/fixtures/handoff/stream_reasoning_test/openai_oauth_codex_gpt-5.4.json
  • test/fixtures/handoff/stream_test/anthropic_apikey_messages_claude-sonnet-4-20250514.json
  • test/fixtures/handoff/stream_test/anthropic_oauth_messages_claude-sonnet-4-20250514.json
  • test/fixtures/handoff/stream_test/openai_apikey_completions_gpt-5.1.json
  • test/fixtures/handoff/stream_test/openai_apikey_responses_gpt-5.4.json
  • test/fixtures/handoff/stream_test/openai_oauth_codex_gpt-5.4.json
  • test/fixtures/vcr_cassettes/integration/live/handoff_media_test_rb/test_live_handoff_media_anthropic_apikey_messages_claude-sonnet-4-20250514.yml
  • test/fixtures/vcr_cassettes/integration/live/handoff_media_test_rb/test_live_handoff_media_openai_apikey_completions_gpt-5_1.yml
  • test/fixtures/vcr_cassettes/integration/live/handoff_media_test_rb/test_live_handoff_media_openai_apikey_responses_gpt-5_4.yml
  • test/fixtures/vcr_cassettes/integration/live/handoff_media_test_rb/test_live_handoff_media_openai_oauth_codex_gpt-5_4.yml
  • test/fixtures/vcr_cassettes/integration/live/handoff_stream_image_test_rb/test_handoff_stream_image__anthropic_apikey_messages_claude-sonnet-4-20250514.yml
  • test/fixtures/vcr_cassettes/integration/live/handoff_stream_image_test_rb/test_handoff_stream_image__anthropic_oauth_messages_claude-sonnet-4-20250514.yml
  • test/fixtures/vcr_cassettes/integration/live/handoff_stream_image_test_rb/test_handoff_stream_image__openai_apikey_completions_gpt-5_1.yml
  • test/fixtures/vcr_cassettes/integration/live/handoff_stream_image_test_rb/test_handoff_stream_image__openai_apikey_responses_gpt-5_4.yml
  • test/fixtures/vcr_cassettes/integration/live/handoff_stream_image_test_rb/test_handoff_stream_image__openai_oauth_codex_gpt-5_4.yml
  • test/fixtures/vcr_cassettes/integration/live/handoff_stream_reasoning_test_rb/test_handoff_stream_reasoning__anthropic_apikey_messages_claude-sonnet-4-20250514.yml
  • test/fixtures/vcr_cassettes/integration/live/handoff_stream_reasoning_test_rb/test_handoff_stream_reasoning__anthropic_oauth_messages_claude-sonnet-4-20250514.yml
  • test/fixtures/vcr_cassettes/integration/live/handoff_stream_reasoning_test_rb/test_handoff_stream_reasoning__openai_apikey_completions_gpt-5_1.yml
  • test/fixtures/vcr_cassettes/integration/live/handoff_stream_reasoning_test_rb/test_handoff_stream_reasoning__openai_apikey_responses_gpt-5_4.yml
  • test/fixtures/vcr_cassettes/integration/live/handoff_stream_reasoning_test_rb/test_handoff_stream_reasoning__openai_oauth_codex_gpt-5_4.yml
  • test/fixtures/vcr_cassettes/integration/live/handoff_stream_test_rb/test_handoff_stream__anthropic_apikey_messages_claude-sonnet-4-20250514.yml
  • test/fixtures/vcr_cassettes/integration/live/handoff_stream_test_rb/test_handoff_stream__anthropic_oauth_messages_claude-sonnet-4-20250514.yml
  • test/fixtures/vcr_cassettes/integration/live/handoff_stream_test_rb/test_handoff_stream__openai_apikey_completions_gpt-5_1.yml
  • test/fixtures/vcr_cassettes/integration/live/handoff_stream_test_rb/test_handoff_stream__openai_apikey_responses_gpt-5_4.yml
  • test/fixtures/vcr_cassettes/integration/live/handoff_stream_test_rb/test_handoff_stream__openai_oauth_codex_gpt-5_4.yml
  • test/fixtures/vcr_cassettes/integration/live/handoff_test_rb/test_live_handoff_anthropic_apikey_messages_claude-sonnet-4-20250514.yml
  • test/fixtures/vcr_cassettes/integration/live/handoff_test_rb/test_live_handoff_openai_apikey_completions_gpt-5_1.yml
  • test/fixtures/vcr_cassettes/integration/live/handoff_test_rb/test_live_handoff_openai_apikey_responses_gpt-5_4.yml
  • test/fixtures/vcr_cassettes/integration/live/handoff_test_rb/test_live_handoff_openai_oauth_codex_gpt-5_4.yml
  • test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_basic_tool_call_anthropic_apikey_messages_claude-sonnet-4-20250514.yml
  • test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_basic_tool_call_anthropic_oauth_messages_claude-sonnet-4-20250514.yml
  • test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_basic_tool_call_openai_apikey_completions_gpt-5_1.yml
  • test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_basic_tool_call_openai_apikey_responses_gpt-5_4.yml
  • test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_basic_tool_call_openai_oauth_codex_gpt-5_4.yml
  • test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_multi_turn_tool_streaming_anthropic_apikey_messages_claude-sonnet-4-20250514.yml
  • test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_multi_turn_tool_streaming_anthropic_oauth_messages_claude-sonnet-4-20250514.yml
  • test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_multi_turn_tool_streaming_openai_apikey_completions_gpt-5_1.yml
  • test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_multi_turn_tool_streaming_openai_apikey_responses_gpt-5_4.yml
  • test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_multi_turn_tool_streaming_openai_oauth_codex_gpt-5_4.yml
  • test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_text_streaming_anthropic_apikey_messages_claude-sonnet-4-20250514.yml
  • test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_text_streaming_anthropic_oauth_messages_claude-sonnet-4-20250514.yml
  • test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_text_streaming_openai_apikey_completions_gpt-5_1.yml
  • test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_text_streaming_openai_apikey_responses_gpt-5_4.yml
  • test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_text_streaming_openai_oauth_codex_gpt-5_4.yml
  • scripts/generate_handoff_live_fixture.rb
  • scripts/generate_handoff_media_fixture.rb
Comments suppressed due to low confidence (1)
    test/integration/live/stream_test.rb:130
  • (test_completeness) Removing `reasoning: "high"` from `multi_turn_tool_stream_test` means reasoning is no longer exercised in this multi-turn test, which reduces test coverage for reasoning behavior in multi-turn scenarios.

SOURCE_TEST_PATH = File.expand_path("stream_image_test.rb", __dir__)
FIXTURE_DIR = File.expand_path("../../fixtures/handoff/stream_image_test", __dir__)

PAIRS = eval(File.read(SOURCE_TEST_PATH).match(/PAIRS = (\[.*?\])\s*\.freeze/m)[1]).freeze
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using eval to parse Ruby code from another file is a significant security risk and a fragile design. If the source file's syntax changes slightly, this will break silently or raise cryptic errors. Consider extracting the PAIRS constant into a shared module or configuration file that both tests can require directly.

SOURCE_TEST_PATH = File.expand_path("stream_reasoning_test.rb", __dir__)
FIXTURE_DIR = File.expand_path("../../fixtures/handoff/stream_reasoning_test", __dir__)

PAIRS = eval(File.read(SOURCE_TEST_PATH).match(/PAIRS = (\[.*?\])\s*\.freeze/m)[1]).freeze
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using eval to extract the PAIRS constant from a source file is dangerous and brittle. The regex match could fail silently returning nil, causing a NoMethodError on [1]. Extract PAIRS into a shared constant or module instead.

SOURCE_TEST_PATH = File.expand_path("stream_test.rb", __dir__)
FIXTURE_DIR = File.expand_path("../../fixtures/handoff/stream_test", __dir__)

PAIRS = eval(File.read(SOURCE_TEST_PATH).match(/PAIRS = (\[.*?\])\s*\.freeze/m)[1]).freeze
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using eval to extract the PAIRS constant from a source file is dangerous and brittle. The regex match could fail silently returning nil, causing a NoMethodError on [1]. Extract PAIRS into a shared constant or module instead.

pair_name = "#{provider}_#{model}".gsub(/[^A-Za-z0-9_.-]+/, "_")
path = File.join(fixture_dir, "#{pair_name}.json")
payload = File.exist?(path) ? JSON.parse(File.read(path)) : {}
payload[name.sub(/^test_/, "")] = jsonable_live_result(result)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name method here refers to the test instance's name (e.g., test_handoff_stream__openai_...). The sub(/^test_/, "") strips the prefix, but if the test name contains characters that are not valid JSON keys or filesystem-safe, this could cause issues. Consider sanitizing the key similarly to how pair_name is sanitized.


pair_name = "#{provider}_#{model}".gsub(/[^A-Za-z0-9_.-]+/, "_")
path = File.join(fixture_dir, "#{pair_name}.json")
payload = File.exist?(path) ? JSON.parse(File.read(path)) : {}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fixture file is read and parsed every time record_live_handoff_result is called within the same test run. If multiple tests for the same pair run concurrently, there is a race condition where one test could overwrite another's results. Consider whether this is a concern in CI environments.

when Hash
value.each_with_object({}) { |(key, item), acc| acc[key.to_s] = jsonable_live_result(item) }
else
value.respond_to?(:to_h) ? jsonable_live_result(value.to_h) : value
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value.respond_to?(:to_h) will match many Ruby objects (e.g., Array responds to to_h in Ruby 2.1+), potentially causing unexpected recursion or data loss. Consider being more explicit about which types should be converted via to_h (e.g., checking for a specific interface or using is_a? checks).


text = response.content.select { |block| block.type == "text" }.map(&:text).join(" ").downcase
assert_includes text, records.length.to_s
assert_includes text, "42"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion assert_includes text, "42" is very likely to produce false positives since "42" is a common substring. For example, if the response contains "142" or "420", the assertion would still pass. Consider asserting on the full expected value in context, or use a more specific check.

@billybonks billybonks merged commit 21bbcd1 into main May 18, 2026
1 check passed
@billybonks billybonks deleted the refactor/handoff-test branch May 18, 2026 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant