refactor: handoff test by billybonks · Pull Request #56 · Hyper-Unearthing/llm_gateway

billybonks · 2026-05-18T13:47:05Z

simplify and modernize the provider integration surface, especially around streaming handoff tests, while cleaning up legacy gateway/test code and improving provider option mapping.

all modern llms should support stream test but they may not do reasoning since we have a reasoning test we can do that seperately

i want the hand off tests to automatically test the ouput of our generated pairs in order to do that we record the result and the handoff test will take all the results and test all the messages to each pair

currently if we add a provider we do a lot of steps in testing to add them the existing test and include them ot the handoff test this pr makes it so that the handoff test relies on the provider test

managerbot-app

Reviewed Changes

Managerbot reviewed 9 out of 65 changed files in this pull request and generated 8 comments.

Files not reviewed (56)

.pi/skills/live-provider-testing/SKILL.md
test/fixtures/handoff/stream_image_test/anthropic_apikey_messages_claude-sonnet-4-20250514.json
test/fixtures/handoff/stream_image_test/anthropic_oauth_messages_claude-sonnet-4-20250514.json
test/fixtures/handoff/stream_image_test/openai_apikey_completions_gpt-5.1.json
test/fixtures/handoff/stream_image_test/openai_apikey_responses_gpt-5.4.json
test/fixtures/handoff/stream_image_test/openai_oauth_codex_gpt-5.4.json
test/fixtures/handoff/stream_reasoning_test/anthropic_apikey_messages_claude-sonnet-4-20250514.json
test/fixtures/handoff/stream_reasoning_test/anthropic_oauth_messages_claude-sonnet-4-20250514.json
test/fixtures/handoff/stream_reasoning_test/openai_apikey_completions_gpt-5.1.json
test/fixtures/handoff/stream_reasoning_test/openai_apikey_responses_gpt-5.4.json
test/fixtures/handoff/stream_reasoning_test/openai_oauth_codex_gpt-5.4.json
test/fixtures/handoff/stream_test/anthropic_apikey_messages_claude-sonnet-4-20250514.json
test/fixtures/handoff/stream_test/anthropic_oauth_messages_claude-sonnet-4-20250514.json
test/fixtures/handoff/stream_test/openai_apikey_completions_gpt-5.1.json
test/fixtures/handoff/stream_test/openai_apikey_responses_gpt-5.4.json
test/fixtures/handoff/stream_test/openai_oauth_codex_gpt-5.4.json
test/fixtures/vcr_cassettes/integration/live/handoff_media_test_rb/test_live_handoff_media_anthropic_apikey_messages_claude-sonnet-4-20250514.yml
test/fixtures/vcr_cassettes/integration/live/handoff_media_test_rb/test_live_handoff_media_openai_apikey_completions_gpt-5_1.yml
test/fixtures/vcr_cassettes/integration/live/handoff_media_test_rb/test_live_handoff_media_openai_apikey_responses_gpt-5_4.yml
test/fixtures/vcr_cassettes/integration/live/handoff_media_test_rb/test_live_handoff_media_openai_oauth_codex_gpt-5_4.yml
test/fixtures/vcr_cassettes/integration/live/handoff_stream_image_test_rb/test_handoff_stream_image__anthropic_apikey_messages_claude-sonnet-4-20250514.yml
test/fixtures/vcr_cassettes/integration/live/handoff_stream_image_test_rb/test_handoff_stream_image__anthropic_oauth_messages_claude-sonnet-4-20250514.yml
test/fixtures/vcr_cassettes/integration/live/handoff_stream_image_test_rb/test_handoff_stream_image__openai_apikey_completions_gpt-5_1.yml
test/fixtures/vcr_cassettes/integration/live/handoff_stream_image_test_rb/test_handoff_stream_image__openai_apikey_responses_gpt-5_4.yml
test/fixtures/vcr_cassettes/integration/live/handoff_stream_image_test_rb/test_handoff_stream_image__openai_oauth_codex_gpt-5_4.yml
test/fixtures/vcr_cassettes/integration/live/handoff_stream_reasoning_test_rb/test_handoff_stream_reasoning__anthropic_apikey_messages_claude-sonnet-4-20250514.yml
test/fixtures/vcr_cassettes/integration/live/handoff_stream_reasoning_test_rb/test_handoff_stream_reasoning__anthropic_oauth_messages_claude-sonnet-4-20250514.yml
test/fixtures/vcr_cassettes/integration/live/handoff_stream_reasoning_test_rb/test_handoff_stream_reasoning__openai_apikey_completions_gpt-5_1.yml
test/fixtures/vcr_cassettes/integration/live/handoff_stream_reasoning_test_rb/test_handoff_stream_reasoning__openai_apikey_responses_gpt-5_4.yml
test/fixtures/vcr_cassettes/integration/live/handoff_stream_reasoning_test_rb/test_handoff_stream_reasoning__openai_oauth_codex_gpt-5_4.yml
test/fixtures/vcr_cassettes/integration/live/handoff_stream_test_rb/test_handoff_stream__anthropic_apikey_messages_claude-sonnet-4-20250514.yml
test/fixtures/vcr_cassettes/integration/live/handoff_stream_test_rb/test_handoff_stream__anthropic_oauth_messages_claude-sonnet-4-20250514.yml
test/fixtures/vcr_cassettes/integration/live/handoff_stream_test_rb/test_handoff_stream__openai_apikey_completions_gpt-5_1.yml
test/fixtures/vcr_cassettes/integration/live/handoff_stream_test_rb/test_handoff_stream__openai_apikey_responses_gpt-5_4.yml
test/fixtures/vcr_cassettes/integration/live/handoff_stream_test_rb/test_handoff_stream__openai_oauth_codex_gpt-5_4.yml
test/fixtures/vcr_cassettes/integration/live/handoff_test_rb/test_live_handoff_anthropic_apikey_messages_claude-sonnet-4-20250514.yml
test/fixtures/vcr_cassettes/integration/live/handoff_test_rb/test_live_handoff_openai_apikey_completions_gpt-5_1.yml
test/fixtures/vcr_cassettes/integration/live/handoff_test_rb/test_live_handoff_openai_apikey_responses_gpt-5_4.yml
test/fixtures/vcr_cassettes/integration/live/handoff_test_rb/test_live_handoff_openai_oauth_codex_gpt-5_4.yml
test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_basic_tool_call_anthropic_apikey_messages_claude-sonnet-4-20250514.yml
test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_basic_tool_call_anthropic_oauth_messages_claude-sonnet-4-20250514.yml
test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_basic_tool_call_openai_apikey_completions_gpt-5_1.yml
test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_basic_tool_call_openai_apikey_responses_gpt-5_4.yml
test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_basic_tool_call_openai_oauth_codex_gpt-5_4.yml
test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_multi_turn_tool_streaming_anthropic_apikey_messages_claude-sonnet-4-20250514.yml
test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_multi_turn_tool_streaming_anthropic_oauth_messages_claude-sonnet-4-20250514.yml
test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_multi_turn_tool_streaming_openai_apikey_completions_gpt-5_1.yml
test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_multi_turn_tool_streaming_openai_apikey_responses_gpt-5_4.yml
test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_multi_turn_tool_streaming_openai_oauth_codex_gpt-5_4.yml
test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_text_streaming_anthropic_apikey_messages_claude-sonnet-4-20250514.yml
test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_text_streaming_anthropic_oauth_messages_claude-sonnet-4-20250514.yml
test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_text_streaming_openai_apikey_completions_gpt-5_1.yml
test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_text_streaming_openai_apikey_responses_gpt-5_4.yml
test/fixtures/vcr_cassettes/integration/live/stream_test_rb/test_live_text_streaming_openai_oauth_codex_gpt-5_4.yml
scripts/generate_handoff_live_fixture.rb
scripts/generate_handoff_media_fixture.rb

Comments suppressed due to low confidence (1)

test/integration/live/stream_test.rb:130

(test_completeness) Removing `reasoning: "high"` from `multi_turn_tool_stream_test` means reasoning is no longer exercised in this multi-turn test, which reduces test coverage for reasoning behavior in multi-turn scenarios.

managerbot-app · 2026-05-18T13:48:33Z

+  SOURCE_TEST_PATH = File.expand_path("stream_image_test.rb", __dir__)
+  FIXTURE_DIR = File.expand_path("../../fixtures/handoff/stream_image_test", __dir__)
+
+  PAIRS = eval(File.read(SOURCE_TEST_PATH).match(/PAIRS = (\[.*?\])\s*\.freeze/m)[1]).freeze


Using eval to parse Ruby code from another file is a significant security risk and a fragile design. If the source file's syntax changes slightly, this will break silently or raise cryptic errors. Consider extracting the PAIRS constant into a shared module or configuration file that both tests can require directly.

managerbot-app · 2026-05-18T13:48:33Z

+  SOURCE_TEST_PATH = File.expand_path("stream_reasoning_test.rb", __dir__)
+  FIXTURE_DIR = File.expand_path("../../fixtures/handoff/stream_reasoning_test", __dir__)
+
+  PAIRS = eval(File.read(SOURCE_TEST_PATH).match(/PAIRS = (\[.*?\])\s*\.freeze/m)[1]).freeze


Using eval to extract the PAIRS constant from a source file is dangerous and brittle. The regex match could fail silently returning nil, causing a NoMethodError on [1]. Extract PAIRS into a shared constant or module instead.

managerbot-app · 2026-05-18T13:48:33Z

+  SOURCE_TEST_PATH = File.expand_path("stream_test.rb", __dir__)
+  FIXTURE_DIR = File.expand_path("../../fixtures/handoff/stream_test", __dir__)
+
+  PAIRS = eval(File.read(SOURCE_TEST_PATH).match(/PAIRS = (\[.*?\])\s*\.freeze/m)[1]).freeze


Using eval to extract the PAIRS constant from a source file is dangerous and brittle. The regex match could fail silently returning nil, causing a NoMethodError on [1]. Extract PAIRS into a shared constant or module instead.

managerbot-app · 2026-05-18T13:48:33Z

+    pair_name = "#{provider}_#{model}".gsub(/[^A-Za-z0-9_.-]+/, "_")
+    path = File.join(fixture_dir, "#{pair_name}.json")
+    payload = File.exist?(path) ? JSON.parse(File.read(path)) : {}
+    payload[name.sub(/^test_/, "")] = jsonable_live_result(result)


The name method here refers to the test instance's name (e.g., test_handoff_stream__openai_...). The sub(/^test_/, "") strips the prefix, but if the test name contains characters that are not valid JSON keys or filesystem-safe, this could cause issues. Consider sanitizing the key similarly to how pair_name is sanitized.

managerbot-app · 2026-05-18T13:48:33Z

+
+    pair_name = "#{provider}_#{model}".gsub(/[^A-Za-z0-9_.-]+/, "_")
+    path = File.join(fixture_dir, "#{pair_name}.json")
+    payload = File.exist?(path) ? JSON.parse(File.read(path)) : {}


The fixture file is read and parsed every time record_live_handoff_result is called within the same test run. If multiple tests for the same pair run concurrently, there is a race condition where one test could overwrite another's results. Consider whether this is a concern in CI environments.

managerbot-app · 2026-05-18T13:48:33Z

+    when Hash
+      value.each_with_object({}) { |(key, item), acc| acc[key.to_s] = jsonable_live_result(item) }
+    else
+      value.respond_to?(:to_h) ? jsonable_live_result(value.to_h) : value


value.respond_to?(:to_h) will match many Ruby objects (e.g., Array responds to to_h in Ruby 2.1+), potentially causing unexpected recursion or data loss. Consider being more explicit about which types should be converted via to_h (e.g., checking for a specific interface or using is_a? checks).

managerbot-app · 2026-05-18T13:48:33Z

+
+    text = response.content.select { |block| block.type == "text" }.map(&:text).join(" ").downcase
+    assert_includes text, records.length.to_s
+    assert_includes text, "42"


The assertion assert_includes text, "42" is very likely to produce false positives since "42" is a common substring. For example, if the response contains "142" or "420", the assertion would still pass. Consider asserting on the full expected value in context, or use a more specific check.

billybonks added 4 commits May 18, 2026 21:20

refactor(tests): stream test should be focused on basic features

927dbcc

all modern llms should support stream test but they may not do reasoning since we have a reasoning test we can do that seperately

feat(test): prepare stream test for new handoff tests

51e8627

i want the hand off tests to automatically test the ouput of our generated pairs in order to do that we record the result and the handoff test will take all the results and test all the messages to each pair

refactor(tests): improve handoff tests

3b89be9

currently if we add a provider we do a lot of steps in testing to add them the existing test and include them ot the handoff test this pr makes it so that the handoff test relies on the provider test

feat(developer) add skill on provider testing

417bdd2

managerbot-app Bot reviewed May 18, 2026

View reviewed changes

billybonks merged commit 21bbcd1 into main May 18, 2026
1 check passed

billybonks deleted the refactor/handoff-test branch May 18, 2026 13:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: handoff test#56

refactor: handoff test#56
billybonks merged 4 commits into
mainfrom
refactor/handoff-test

billybonks commented May 18, 2026

Uh oh!

managerbot-app Bot left a comment

Uh oh!

managerbot-app Bot May 18, 2026

Uh oh!

managerbot-app Bot May 18, 2026

Uh oh!

managerbot-app Bot May 18, 2026

Uh oh!

managerbot-app Bot May 18, 2026

Uh oh!

managerbot-app Bot May 18, 2026

Uh oh!

managerbot-app Bot May 18, 2026

Uh oh!

managerbot-app Bot May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

billybonks commented May 18, 2026

Uh oh!

managerbot-app Bot left a comment

Choose a reason for hiding this comment

Reviewed Changes

Uh oh!

managerbot-app Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

managerbot-app Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

managerbot-app Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

managerbot-app Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

managerbot-app Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

managerbot-app Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

managerbot-app Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant