refactor: handoff test#56
Conversation
all modern llms should support stream test but they may not do reasoning since we have a reasoning test we can do that seperately
i want the hand off tests to automatically test the ouput of our generated pairs in order to do that we record the result and the handoff test will take all the results and test all the messages to each pair
currently if we add a provider we do a lot of steps in testing to add them the existing test and include them ot the handoff test this pr makes it so that the handoff test relies on the provider test
There was a problem hiding this comment.
Reviewed Changes
Managerbot reviewed 9 out of 65 changed files in this pull request and generated 8 comments.
Files not reviewed (56)
- .pi/skills/live-provider-testing/SKILL.md
- test/fixtures/handoff/stream_image_test/anthropic_apikey_messages_claude-sonnet-4-20250514.json
- test/fixtures/handoff/stream_image_test/anthropic_oauth_messages_claude-sonnet-4-20250514.json
- test/fixtures/handoff/stream_image_test/openai_apikey_completions_gpt-5.1.json
- test/fixtures/handoff/stream_image_test/openai_apikey_responses_gpt-5.4.json
- test/fixtures/handoff/stream_image_test/openai_oauth_codex_gpt-5.4.json
- test/fixtures/handoff/stream_reasoning_test/anthropic_apikey_messages_claude-sonnet-4-20250514.json
- test/fixtures/handoff/stream_reasoning_test/anthropic_oauth_messages_claude-sonnet-4-20250514.json
- test/fixtures/handoff/stream_reasoning_test/openai_apikey_completions_gpt-5.1.json
- test/fixtures/handoff/stream_reasoning_test/openai_apikey_responses_gpt-5.4.json
- test/fixtures/handoff/stream_reasoning_test/openai_oauth_codex_gpt-5.4.json
- test/fixtures/handoff/stream_test/anthropic_apikey_messages_claude-sonnet-4-20250514.json
- test/fixtures/handoff/stream_test/anthropic_oauth_messages_claude-sonnet-4-20250514.json
- test/fixtures/handoff/stream_test/openai_apikey_completions_gpt-5.1.json
- test/fixtures/handoff/stream_test/openai_apikey_responses_gpt-5.4.json
- test/fixtures/handoff/stream_test/openai_oauth_codex_gpt-5.4.json
- test/fixtures/vcr_cassettes/integration/live/handoff_media_test_rb/test_live_handoff_media_anthropic_apikey_messages_claude-sonnet-4-20250514.yml
- test/fixtures/vcr_cassettes/integration/live/handoff_media_test_rb/test_live_handoff_media_openai_apikey_completions_gpt-5_1.yml
- test/fixtures/vcr_cassettes/integration/live/handoff_media_test_rb/test_live_handoff_media_openai_apikey_responses_gpt-5_4.yml
- test/fixtures/vcr_cassettes/integration/live/handoff_media_test_rb/test_live_handoff_media_openai_oauth_codex_gpt-5_4.yml
- test/fixtures/vcr_cassettes/integration/live/handoff_stream_image_test_rb/test_handoff_stream_image__anthropic_apikey_messages_claude-sonnet-4-20250514.yml
- test/fixtures/vcr_cassettes/integration/live/handoff_stream_image_test_rb/test_handoff_stream_image__anthropic_oauth_messages_claude-sonnet-4-20250514.yml
- test/fixtures/vcr_cassettes/integration/live/handoff_stream_image_test_rb/test_handoff_stream_image__openai_apikey_completions_gpt-5_1.yml
- test/fixtures/vcr_cassettes/integration/live/handoff_stream_image_test_rb/test_handoff_stream_image__openai_apikey_responses_gpt-5_4.yml
- test/fixtures/vcr_cassettes/integration/live/handoff_stream_image_test_rb/test_handoff_stream_image__openai_oauth_codex_gpt-5_4.yml
- test/fixtures/vcr_cassettes/integration/live/handoff_stream_reasoning_test_rb/test_handoff_stream_reasoning__anthropic_apikey_messages_claude-sonnet-4-20250514.yml
- test/fixtures/vcr_cassettes/integration/live/handoff_stream_reasoning_test_rb/test_handoff_stream_reasoning__anthropic_oauth_messages_claude-sonnet-4-20250514.yml
- test/fixtures/vcr_cassettes/integration/live/handoff_stream_reasoning_test_rb/test_handoff_stream_reasoning__openai_apikey_completions_gpt-5_1.yml
- test/fixtures/vcr_cassettes/integration/live/handoff_stream_reasoning_test_rb/test_handoff_stream_reasoning__openai_apikey_responses_gpt-5_4.yml
- test/fixtures/vcr_cassettes/integration/live/handoff_stream_reasoning_test_rb/test_handoff_stream_reasoning__openai_oauth_codex_gpt-5_4.yml
- test/fixtures/vcr_cassettes/integration/live/handoff_stream_test_rb/test_handoff_stream__anthropic_apikey_messages_claude-sonnet-4-20250514.yml
- test/fixtures/vcr_cassettes/integration/live/handoff_stream_test_rb/test_handoff_stream__anthropic_oauth_messages_claude-sonnet-4-20250514.yml
- test/fixtures/vcr_cassettes/integration/live/handoff_stream_test_rb/test_handoff_stream__openai_apikey_completions_gpt-5_1.yml
- test/fixtures/vcr_cassettes/integration/live/handoff_stream_test_rb/test_handoff_stream__openai_apikey_responses_gpt-5_4.yml
- test/fixtures/vcr_cassettes/integration/live/handoff_stream_test_rb/test_handoff_stream__openai_oauth_codex_gpt-5_4.yml
- test/fixtures/vcr_cassettes/integration/live/handoff_test_rb/test_live_handoff_anthropic_apikey_messages_claude-sonnet-4-20250514.yml
- test/fixtures/vcr_cassettes/integration/live/handoff_test_rb/test_live_handoff_openai_apikey_completions_gpt-5_1.yml
- test/fixtures/vcr_cassettes/integration/live/handoff_test_rb/test_live_handoff_openai_apikey_responses_gpt-5_4.yml
- test/fixtures/vcr_cassettes/integration/live/handoff_test_rb/test_live_handoff_openai_oauth_codex_gpt-5_4.yml
- test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_basic_tool_call_anthropic_apikey_messages_claude-sonnet-4-20250514.yml
- test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_basic_tool_call_anthropic_oauth_messages_claude-sonnet-4-20250514.yml
- test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_basic_tool_call_openai_apikey_completions_gpt-5_1.yml
- test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_basic_tool_call_openai_apikey_responses_gpt-5_4.yml
- test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_basic_tool_call_openai_oauth_codex_gpt-5_4.yml
- test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_multi_turn_tool_streaming_anthropic_apikey_messages_claude-sonnet-4-20250514.yml
- test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_multi_turn_tool_streaming_anthropic_oauth_messages_claude-sonnet-4-20250514.yml
- test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_multi_turn_tool_streaming_openai_apikey_completions_gpt-5_1.yml
- test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_multi_turn_tool_streaming_openai_apikey_responses_gpt-5_4.yml
- test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_multi_turn_tool_streaming_openai_oauth_codex_gpt-5_4.yml
- test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_text_streaming_anthropic_apikey_messages_claude-sonnet-4-20250514.yml
- test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_text_streaming_anthropic_oauth_messages_claude-sonnet-4-20250514.yml
- test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_text_streaming_openai_apikey_completions_gpt-5_1.yml
- test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_text_streaming_openai_apikey_responses_gpt-5_4.yml
- test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_text_streaming_openai_oauth_codex_gpt-5_4.yml
- scripts/generate_handoff_live_fixture.rb
- scripts/generate_handoff_media_fixture.rb
Comments suppressed due to low confidence (1)
-
test/integration/live/stream_test.rb:130
- (test_completeness) Removing `reasoning: "high"` from `multi_turn_tool_stream_test` means reasoning is no longer exercised in this multi-turn test, which reduces test coverage for reasoning behavior in multi-turn scenarios.
| SOURCE_TEST_PATH = File.expand_path("stream_image_test.rb", __dir__) | ||
| FIXTURE_DIR = File.expand_path("../../fixtures/handoff/stream_image_test", __dir__) | ||
|
|
||
| PAIRS = eval(File.read(SOURCE_TEST_PATH).match(/PAIRS = (\[.*?\])\s*\.freeze/m)[1]).freeze |
There was a problem hiding this comment.
Using eval to parse Ruby code from another file is a significant security risk and a fragile design. If the source file's syntax changes slightly, this will break silently or raise cryptic errors. Consider extracting the PAIRS constant into a shared module or configuration file that both tests can require directly.
| SOURCE_TEST_PATH = File.expand_path("stream_reasoning_test.rb", __dir__) | ||
| FIXTURE_DIR = File.expand_path("../../fixtures/handoff/stream_reasoning_test", __dir__) | ||
|
|
||
| PAIRS = eval(File.read(SOURCE_TEST_PATH).match(/PAIRS = (\[.*?\])\s*\.freeze/m)[1]).freeze |
There was a problem hiding this comment.
Using eval to extract the PAIRS constant from a source file is dangerous and brittle. The regex match could fail silently returning nil, causing a NoMethodError on [1]. Extract PAIRS into a shared constant or module instead.
| SOURCE_TEST_PATH = File.expand_path("stream_test.rb", __dir__) | ||
| FIXTURE_DIR = File.expand_path("../../fixtures/handoff/stream_test", __dir__) | ||
|
|
||
| PAIRS = eval(File.read(SOURCE_TEST_PATH).match(/PAIRS = (\[.*?\])\s*\.freeze/m)[1]).freeze |
There was a problem hiding this comment.
Using eval to extract the PAIRS constant from a source file is dangerous and brittle. The regex match could fail silently returning nil, causing a NoMethodError on [1]. Extract PAIRS into a shared constant or module instead.
| pair_name = "#{provider}_#{model}".gsub(/[^A-Za-z0-9_.-]+/, "_") | ||
| path = File.join(fixture_dir, "#{pair_name}.json") | ||
| payload = File.exist?(path) ? JSON.parse(File.read(path)) : {} | ||
| payload[name.sub(/^test_/, "")] = jsonable_live_result(result) |
There was a problem hiding this comment.
The name method here refers to the test instance's name (e.g., test_handoff_stream__openai_...). The sub(/^test_/, "") strips the prefix, but if the test name contains characters that are not valid JSON keys or filesystem-safe, this could cause issues. Consider sanitizing the key similarly to how pair_name is sanitized.
|
|
||
| pair_name = "#{provider}_#{model}".gsub(/[^A-Za-z0-9_.-]+/, "_") | ||
| path = File.join(fixture_dir, "#{pair_name}.json") | ||
| payload = File.exist?(path) ? JSON.parse(File.read(path)) : {} |
There was a problem hiding this comment.
The fixture file is read and parsed every time record_live_handoff_result is called within the same test run. If multiple tests for the same pair run concurrently, there is a race condition where one test could overwrite another's results. Consider whether this is a concern in CI environments.
| when Hash | ||
| value.each_with_object({}) { |(key, item), acc| acc[key.to_s] = jsonable_live_result(item) } | ||
| else | ||
| value.respond_to?(:to_h) ? jsonable_live_result(value.to_h) : value |
There was a problem hiding this comment.
value.respond_to?(:to_h) will match many Ruby objects (e.g., Array responds to to_h in Ruby 2.1+), potentially causing unexpected recursion or data loss. Consider being more explicit about which types should be converted via to_h (e.g., checking for a specific interface or using is_a? checks).
|
|
||
| text = response.content.select { |block| block.type == "text" }.map(&:text).join(" ").downcase | ||
| assert_includes text, records.length.to_s | ||
| assert_includes text, "42" |
There was a problem hiding this comment.
The assertion assert_includes text, "42" is very likely to produce false positives since "42" is a common substring. For example, if the response contains "142" or "420", the assertion would still pass. Consider asserting on the full expected value in context, or use a more specific check.
simplify and modernize the provider integration surface, especially around streaming handoff tests, while cleaning up legacy gateway/test code and improving provider option mapping.