Add text-only testing framework for agents#364
Conversation
Text-only testing SDK for Vision-Agents agents. Includes TestSession, fluent assertion API (RunResult/RunAssert), LLM-based judge, mock_tools, unit tests, integration test examples for 00_example and 01_simple_agent_example.
Covers all public classes, methods, event types, recommended patterns, and architecture rationale for testing at the LLM level.
Replace cursor-based assertion classes (RunAssert, EventAssert, etc.) with scenario-style methods on TestEval: user_says, agent_calls, agent_responds, no_more_events. ~500 lines removed.
Align with core LLM method naming and clarify intent.
TestEval now only handles lifecycle and LLM communication. TestResponse holds data (output, events, function_calls, duration_ms) and assertion methods (agent_calls, judge, no_more_events).
Reflect current API: TestResponse with function_called/function_output/judge, simple_response returns TestResponse, assertions on response not session.
Add table of contents, align tables, consolidate architecture section, add second quick start example with tool calls.
- Remove unused TestSession alias from __init__.py - Remove duplicate docstring, unused logging/os imports from _session.py - Deduplicate _evals_verbose (import from _run_result instead of redefining) - Fix class docstring/__test__ ordering in TestEval and TestResponse - Add -> None return type to _on_tool_start and _on_tool_end - Make _advance_to_type generic via TypeVar, remove redundant assert isinstance() - Remove unreachable RuntimeError, annotate _raise_with_debug_info as NoReturn - Remove dead else branch in _format_events - Replace except Exception with specific exceptions in _judge.py - Remove from __future__ import annotations, use quoted forward refs - Move evaluate_intent import to module level in _run_result.py
Drop VISION_AGENTS_EVALS_VERBOSE env var, associated print() calls, and _evals_verbose flag — premature for v1, avoids documenting and potentially deprecating a public env variable. Remove from __future__ import annotations from _events.py and _mock_tools.py (no forward references, Python 3.10+).
Replace two identical branches with a _VERDICTS mapping lookup, removing duplicated string slicing logic.
Extract _truncate, _format_event, _format_events as static methods on TestResponse — they only serve _raise_with_debug_info. Move magic numbers into documented module-level constants.
Docs live in a separate repo, so the in-package README is redundant. Move essential usage examples and key exports summary into the module docstring where help() and IDE tooltips can surface them.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new vision_agents.testing package providing event dataclasses, a TestSession async test harness, TestResponse assertion helpers, an LLM-based intent judge, a mock_tools context manager, updated example wiring, and unit/integration tests for the testing primitives. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User
participant TestSession as TestSession
participant LLM as LLM
participant EventMgr as EventManager
participant Conv as InMemoryConversation
participant Judge as JudgeLLM
User->>TestSession: __aenter__/start()
TestSession->>LLM: apply test instructions
TestSession->>EventMgr: subscribe ToolStart/ToolEnd
TestSession->>Conv: attach conversation
User->>TestSession: simple_response("text")
TestSession->>LLM: send user message
LLM->>EventMgr: emit ToolStart / ToolEnd
EventMgr->>TestSession: _on_tool_start / _on_tool_end
TestSession->>TestSession: record FunctionCallEvent & FunctionCallOutputEvent
TestSession->>Conv: append assistant message
TestSession-->>User: return TestResponse (events, output, duration, judge)
User->>TestResponse: .function_called / .function_output / .judge
alt .judge uses JudgeLLM
TestResponse->>Judge: evaluate_intent(message, intent)
Judge-->>TestResponse: PASS/FAIL verdict
end
User->>TestSession: __aexit__/close()
TestSession->>EventMgr: unsubscribe events
TestSession->>LLM: restore original instructions
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Switch test imports from internal modules (_events, _run_result, _mock_tools) to the public vision_agents.testing API. Replace direct _advance_to_type() call with function_output() in test_explicit_output_check.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (14)
examples/00_example/agent.py (1)
1-4: Add module-level loggerThe file is missing a
loggingimport andlogger = logging.getLogger(__name__), which is required for all Python modules per the coding guidelines.♻️ Proposed addition
+import logging + from dotenv import load_dotenv from vision_agents.core import Agent, AgentLauncher, User, Runner from vision_agents.plugins import getstream, gemini + +logger = logging.getLogger(__name__)As per coding guidelines: "Use module-level
logger = logging.getLogger(__name__)."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/00_example/agent.py` around lines 1 - 4, Add a module-level logger by importing the logging module and creating logger = logging.getLogger(__name__) at the top of the module (near the other imports) so this file (which defines/uses Agent, AgentLauncher, User, Runner and imports getstream/gemini) follows the project coding guidelines for module-level logging.examples/01_simple_agent_example/simple_agent_example.py (1)
2-2: Replace deprecatedDictwith built-indict
Dictfromtypingis deprecated since Python 3.9. The guideline requires modern generic syntax.♻️ Proposed fix
-from typing import Any, Dict +from typing import Any- async def get_weather(location: str) -> Dict[str, Any]: + async def get_weather(location: str) -> dict[str, Any]:As per coding guidelines: "Use modern syntax: ...
dict[str, T]generics".Also applies to: 38-38
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/01_simple_agent_example/simple_agent_example.py` at line 2, Replace the deprecated typing.Dict usage with the built-in generic dict: update the import line to remove Dict (keep Any if used) and change type annotations like Dict[...] to dict[...] (e.g., in simple_agent_example.py replace any Dict[str, Any] or Dict[...] at the import and at the usage on line 38 with dict[str, Any] or the appropriate dict[...] form); ensure all occurrences of Dict are removed from imports and replaced in annotations while preserving Any and other types.tests/test_testing/test_eval.py (1)
168-172: Prefer the public API over direct_cursormutation
response._cursor = 1reaches into the private state ofTestResponse. Use the public assertion API to advance the cursor instead, which also makes the test's intent clearer.♻️ Proposed refactor
- def test_pass_at_end(self): + async def test_pass_at_end(self): response = _make_response(_simple_events()) - response._cursor = 1 + await response.judge() # consumes the only event, advances cursor past it response.no_more_events()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_testing/test_eval.py` around lines 168 - 172, Replace the direct private-state mutation response._cursor = 1 with the public API that advances or asserts consumption on TestResponse; instead of setting _cursor directly in test_pass_at_end, call the appropriate public method on response (for example response.consume_event() or response.assert_events_consumed(1) / response.advance(n) depending on the available API) so the test uses _make_response/_simple_events and response.no_more_events() without touching private attributes.agents-core/vision_agents/testing/_run_result.py (3)
220-232:_format_eventforChatMessageEventtruncates without_truncate(), losing the...suffix.Line 223 uses
event.content[:_PREVIEW_MAX_LEN]directly, while_format_eventforFunctionCallOutputEvent(line 230) delegates to_truncate()which appends"..."when the text exceeds the limit. Using_truncateconsistently avoids confusing debug output where a long chat message is silently clipped without any visual indicator.♻️ Suggested fix
if isinstance(event, ChatMessageEvent): - preview = event.content[:_PREVIEW_MAX_LEN].replace("\n", "\\n") + preview = TestResponse._truncate( + event.content.replace("\n", "\\n") + ) return f"ChatMessageEvent(role='{event.role}', content='{preview}')"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@agents-core/vision_agents/testing/_run_result.py` around lines 220 - 232, The ChatMessageEvent branch in TestResponse._format_event currently slices event.content directly (event.content[:_PREVIEW_MAX_LEN]) which silently truncates without the "..." suffix; change it to call TestResponse._truncate(event.content) (and still replace "\n" with "\\n" on the truncated result) so that long chat messages get the same "..." indicator as FunctionCallOutputEvent; update the ChatMessageEvent handling in _format_event to use TestResponse._truncate and preserve event.role and newline escaping.
195-206:_advance_to_typesilently skips non-matching events — document this behavior.The while loop on line 198 advances past any event that doesn't match
expected_type. If a user callsfunction_called()but aChatMessageEventprecedes theFunctionCallEvent, it is silently consumed. This is consistent with the cursor-based design, but a brief note in the class docstring (or the method docstring) would help users understand they cannot "go back" to skipped events.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@agents-core/vision_agents/testing/_run_result.py` around lines 195 - 206, The _advance_to_type method silently consumes non-matching events while advancing the internal cursor, which can cause earlier events (e.g., a ChatMessageEvent) to be skipped before a later FunctionCallEvent; update the documentation to make this behavior explicit by adding a sentence to the class docstring or the _advance_to_type docstring that states the method advances the cursor forward, skips any events that don't match expected_type, and that skipped events cannot be revisited (no backtracking), and reference the method name _advance_to_type and the cursor semantics (self._cursor) so callers know to check event order before calling helpers like function_called().
24-40:_judge_llmtyped asAny— considerLLM | Nonefor type safety.The field is always either an
LLMinstance orNone. Typing it asAnysuppresses type-checker feedback on misuse. If avoiding a circular import is the concern, aTYPE_CHECKING-guarded import would let you annotate precisely.As per coding guidelines, "Use type annotations everywhere. Use modern syntax:
X | Yunions,dict[str, T]generics".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@agents-core/vision_agents/testing/_run_result.py` around lines 24 - 40, The _judge_llm field is currently typed as Any—change it to a precise LLM | None annotation to restore type-safety: add a TYPE_CHECKING-guarded import (from typing import TYPE_CHECKING; if TYPE_CHECKING: from <appropriate_module> import LLM) to avoid circular imports, then update the TestResponse field declaration from "_judge_llm: Any = field(default=None, repr=False)" to "_judge_llm: LLM | None = field(default=None, repr=False)" (keeping default and repr settings intact) so type checkers see the correct type while runtime imports remain safe.agents-core/vision_agents/testing/_session.py (3)
1-15: Imports that serve only type annotations could be guarded underTYPE_CHECKING.
EventManager(line 5) andRunEvent(line 13) are used solely for type hints. Per coding guidelines, use theTYPE_CHECKINGguard for imports only needed by type annotations. This would require addingfrom __future__ import annotationsto defer evaluation.As per coding guidelines, "Use
TYPE_CHECKINGguard for imports only needed by type annotations".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@agents-core/vision_agents/testing/_session.py` around lines 1 - 15, Imports used only for type annotations (EventManager and RunEvent) should be guarded by TYPE_CHECKING and defer evaluation: add "from __future__ import annotations" at the top, import "from typing import TYPE_CHECKING", then move the EventManager and RunEvent imports into an "if TYPE_CHECKING:" block; update any references to those symbols (EventManager, RunEvent) so they remain as forward-referenced types. Ensure runtime behavior is unchanged and only annotation-only imports are moved.
125-126: Hardcoded 5-second event wait timeout.
await self._event_manager.wait(timeout=5.0)uses a fixed timeout that may be too tight for slow LLM providers or unnecessarily long for fast mocks. Consider making it configurable via__init__orsimple_responseparameter with a sensible default.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@agents-core/vision_agents/testing/_session.py` around lines 125 - 126, The hardcoded 5.0s in await self._event_manager.wait(timeout=5.0) should be made configurable: add an event wait timeout parameter (e.g., event_wait_timeout: float = 5.0) to the class __init__ (or to the simple_response entry point if more appropriate), store it as self._event_wait_timeout, and replace the literal 5.0 in the _event_manager.wait call with self._event_wait_timeout; update any instantiations/tests that rely on the previous behavior to pass a shorter timeout for mocks or leave the default for real providers.
76-85:close()does not reset_conversationor_event_manager, preventing clean restart.After
close(),_startedisFalse, sostart()can run again. Butstart()only runs whennot self._started, and it unconditionally reassigns_conversationand_event_manager. So re-entry works — no bug here. However, holding references to stale objects after close may keep resources alive longer than necessary.♻️ Optional cleanup
async def close(self) -> None: """Clean up resources.""" if not self._started: return if self._event_manager is not None: self._event_manager.unsubscribe(self._on_tool_start) self._event_manager.unsubscribe(self._on_tool_end) + self._event_manager = None + self._conversation = None + self._captured_events.clear() self._started = False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@agents-core/vision_agents/testing/_session.py` around lines 76 - 85, The close() method unsubscribes handlers and flips _started but leaves references to _conversation and _event_manager, which can keep resources alive; update close() to also set self._conversation = None and self._event_manager = None (after unsubscribing _on_tool_start/_on_tool_end) so the session releases stale objects and can fully clean up before a restart; retain existing unsubscribe logic for _on_tool_start and _on_tool_end and only nullify after those calls.examples/00_example/test_agent.py (4)
30-44: Repetitive LLM/judge instantiation across tests — consider a fixture.Every test creates
gemini.LLM(MODEL)andgemini.LLM(MODEL)identically. A function-scoped fixture would DRY this up:♻️ Sketch
`@pytest.fixture` def llm(): return gemini.LLM(MODEL) `@pytest.fixture` def judge_llm(): return gemini.LLM(MODEL)Also applies to: 47-63, 66-82, 85-105, 108-139, 142-160
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/00_example/test_agent.py` around lines 30 - 44, Multiple tests repeatedly instantiate gemini.LLM(MODEL); introduce pytest fixtures to DRY this up by creating a function-scoped fixture (e.g., llm) that returns gemini.LLM(MODEL) and another fixture (e.g., judge_llm) for the judge LLM, then update tests like test_greeting to accept llm and judge_llm as parameters and pass those into TestEval instead of constructing gemini.LLM(MODEL) inline; ensure fixtures are imported/defined in the test module so all tests that use TestEval (and other tests noted) reuse the fixtures.
129-131: The generator-throw lambda is fragile and hard to read.
lambda location: (_ for _ in ()).throw(RuntimeError(...))is a clever trick but obscure. A plainasync defis clearer and also matches the async signature of the original tool, avoiding any sync/async mismatch:♻️ Suggested alternative
- with mock_tools( - llm, - { - "get_weather": lambda location: (_ for _ in ()).throw( - RuntimeError("Service unavailable") - ) - }, - ): + async def _failing_weather(location: str) -> dict[str, str]: + raise RuntimeError("Service unavailable") + + with mock_tools(llm, {"get_weather": _failing_weather}):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/00_example/test_agent.py` around lines 129 - 131, Replace the obscure generator-throw lambda used for "get_weather" with an async function that simply raises RuntimeError to match the original tool's async signature; locate the "get_weather" entry in the test agent setup (the lambda: (_ for _ in ()).throw(RuntimeError("Service unavailable"))) and convert it to an async def get_weather(...) that raises RuntimeError("Service unavailable") so the stub is readable and has the correct async behavior.
25-27: Consider using a pytest fixture orskipUnlessfor the API key check.Every test calls
_skip_if_no_key()manually. A session-scoped fixture orpytest.mark.skipifat module level would reduce repetition and ensure the skip can't be accidentally omitted in a new test.♻️ Suggested alternative
+@pytest.fixture(autouse=True) +def _require_google_api_key(): + if not os.getenv("GOOGLE_API_KEY"): + pytest.skip("GOOGLE_API_KEY not set") + -def _skip_if_no_key(): - if not os.getenv("GOOGLE_API_KEY"): - pytest.skip("GOOGLE_API_KEY not set")Then remove the
_skip_if_no_key()calls from each test body.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/00_example/test_agent.py` around lines 25 - 27, Replace the manual _skip_if_no_key() calls with a centralized pytest skip so tests can't forget to check the env var: either add a module-level pytestmark = pytest.mark.skipif(not os.getenv("GOOGLE_API_KEY"), reason="GOOGLE_API_KEY not set") or create a session-scoped fixture (e.g., require_google_api_key) that checks os.getenv("GOOGLE_API_KEY") and calls pytest.skip(...) if missing, then remove all calls to _skip_if_no_key() from individual test functions; keep references to the existing helper name _skip_if_no_key in case you want to deprecate/redirect it to the fixture for backwards compatibility.
93-94: Use parameterized genericdictin return type annotations.The coding guidelines require modern type syntax with full generics.
dictshould bedict[str, Any](or more specific) for the return types.As per coding guidelines, "Use modern syntax:
X | Yunions,dict[str, T]generics, fullCallablesignatures".Also applies to: 118-119
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/00_example/test_agent.py` around lines 93 - 94, The return type for get_weather uses an unparameterized dict; update its annotation to a parameterized generic such as dict[str, Any] (or a more specific mapping like dict[str, int | str]) and import Any from typing; also find any other functions in this file that currently return bare dict (e.g., the later async function returning a dict) and update their return annotations similarly to use dict[str, Any] or a more specific type to follow the modern typing guidelines.agents-core/vision_agents/testing/_judge.py (1)
59-60: Add public getter forinstructionsto theLLMclass to avoid accessing private_instructionsdirectly.Line 59 reads
llm._instructions(a private attribute) and stores it to restore later. TheLLMclass providesset_instructions()as a public method but has no getter—requiring this code to reach into internal state. Add a read-only propertyinstructionsto theLLMbase class:`@property` def instructions(self) -> str: """Get the current instructions.""" return self._instructionsThen replace line 59 with
original_instructions = llm.instructions. This maintains the same semantics while respecting encapsulation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@agents-core/vision_agents/testing/_judge.py` around lines 59 - 60, Add a read-only public property instructions to the LLM class that returns the internal _instructions (i.e., implement property instructions -> return self._instructions) and then update the caller in _judge.py to use llm.instructions instead of accessing llm._instructions directly; keep existing set_instructions(...) behavior intact so callers like set_instructions(_JUDGE_SYSTEM_PROMPT) still work and original_instructions is obtained via the new instructions property.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@agents-core/vision_agents/testing/_mock_tools.py`:
- Around line 44-57: The loop that validates and swaps functions mutates
registry._functions before entering the try/finally, so a KeyError halfway
leaves some tools swapped; change mock_tools to first validate all tool names
against registry._functions (using func_def = registry._functions.get(tool_name)
and raising KeyError if any missing) and only after successful validation
populate originals and perform the swaps (assign func_def.function = mock_fn)
inside the try block so the finally restoration loop (iterating originals and
resetting func_def.function = original_fn) always runs; ensure you still use the
same identifiers (registry._functions, originals, func_def.function, mocks) so
existing logic and the restoration in the finally block remain unchanged.
In `@tests/test_testing/test_mock_tools.py`:
- Around line 9-13: The test stub _FakeLLM subclasses LLM but doesn't implement
all abstract methods and simple_response lacks type annotations; update it so it
either (A) becomes a plain helper function (e.g., fake_simple_response(text: str
= "", **kwargs) -> LLMResponseEvent) used by tests instead of subclassing LLM,
or (B) fully implements all abstract LLM methods and add proper type annotations
to simple_response as async def simple_response(self, text: str = "", **kwargs)
-> LLMResponseEvent so the class can be instantiated; reference _FakeLLM, LLM,
simple_response, and LLMResponseEvent when making the change.
---
Nitpick comments:
In `@agents-core/vision_agents/testing/_judge.py`:
- Around line 59-60: Add a read-only public property instructions to the LLM
class that returns the internal _instructions (i.e., implement property
instructions -> return self._instructions) and then update the caller in
_judge.py to use llm.instructions instead of accessing llm._instructions
directly; keep existing set_instructions(...) behavior intact so callers like
set_instructions(_JUDGE_SYSTEM_PROMPT) still work and original_instructions is
obtained via the new instructions property.
In `@agents-core/vision_agents/testing/_run_result.py`:
- Around line 220-232: The ChatMessageEvent branch in TestResponse._format_event
currently slices event.content directly (event.content[:_PREVIEW_MAX_LEN]) which
silently truncates without the "..." suffix; change it to call
TestResponse._truncate(event.content) (and still replace "\n" with "\\n" on the
truncated result) so that long chat messages get the same "..." indicator as
FunctionCallOutputEvent; update the ChatMessageEvent handling in _format_event
to use TestResponse._truncate and preserve event.role and newline escaping.
- Around line 195-206: The _advance_to_type method silently consumes
non-matching events while advancing the internal cursor, which can cause earlier
events (e.g., a ChatMessageEvent) to be skipped before a later
FunctionCallEvent; update the documentation to make this behavior explicit by
adding a sentence to the class docstring or the _advance_to_type docstring that
states the method advances the cursor forward, skips any events that don't match
expected_type, and that skipped events cannot be revisited (no backtracking),
and reference the method name _advance_to_type and the cursor semantics
(self._cursor) so callers know to check event order before calling helpers like
function_called().
- Around line 24-40: The _judge_llm field is currently typed as Any—change it to
a precise LLM | None annotation to restore type-safety: add a
TYPE_CHECKING-guarded import (from typing import TYPE_CHECKING; if
TYPE_CHECKING: from <appropriate_module> import LLM) to avoid circular imports,
then update the TestResponse field declaration from "_judge_llm: Any =
field(default=None, repr=False)" to "_judge_llm: LLM | None =
field(default=None, repr=False)" (keeping default and repr settings intact) so
type checkers see the correct type while runtime imports remain safe.
In `@agents-core/vision_agents/testing/_session.py`:
- Around line 1-15: Imports used only for type annotations (EventManager and
RunEvent) should be guarded by TYPE_CHECKING and defer evaluation: add "from
__future__ import annotations" at the top, import "from typing import
TYPE_CHECKING", then move the EventManager and RunEvent imports into an "if
TYPE_CHECKING:" block; update any references to those symbols (EventManager,
RunEvent) so they remain as forward-referenced types. Ensure runtime behavior is
unchanged and only annotation-only imports are moved.
- Around line 125-126: The hardcoded 5.0s in await
self._event_manager.wait(timeout=5.0) should be made configurable: add an event
wait timeout parameter (e.g., event_wait_timeout: float = 5.0) to the class
__init__ (or to the simple_response entry point if more appropriate), store it
as self._event_wait_timeout, and replace the literal 5.0 in the
_event_manager.wait call with self._event_wait_timeout; update any
instantiations/tests that rely on the previous behavior to pass a shorter
timeout for mocks or leave the default for real providers.
- Around line 76-85: The close() method unsubscribes handlers and flips _started
but leaves references to _conversation and _event_manager, which can keep
resources alive; update close() to also set self._conversation = None and
self._event_manager = None (after unsubscribing _on_tool_start/_on_tool_end) so
the session releases stale objects and can fully clean up before a restart;
retain existing unsubscribe logic for _on_tool_start and _on_tool_end and only
nullify after those calls.
In `@examples/00_example/agent.py`:
- Around line 1-4: Add a module-level logger by importing the logging module and
creating logger = logging.getLogger(__name__) at the top of the module (near the
other imports) so this file (which defines/uses Agent, AgentLauncher, User,
Runner and imports getstream/gemini) follows the project coding guidelines for
module-level logging.
In `@examples/00_example/test_agent.py`:
- Around line 30-44: Multiple tests repeatedly instantiate gemini.LLM(MODEL);
introduce pytest fixtures to DRY this up by creating a function-scoped fixture
(e.g., llm) that returns gemini.LLM(MODEL) and another fixture (e.g., judge_llm)
for the judge LLM, then update tests like test_greeting to accept llm and
judge_llm as parameters and pass those into TestEval instead of constructing
gemini.LLM(MODEL) inline; ensure fixtures are imported/defined in the test
module so all tests that use TestEval (and other tests noted) reuse the
fixtures.
- Around line 129-131: Replace the obscure generator-throw lambda used for
"get_weather" with an async function that simply raises RuntimeError to match
the original tool's async signature; locate the "get_weather" entry in the test
agent setup (the lambda: (_ for _ in ()).throw(RuntimeError("Service
unavailable"))) and convert it to an async def get_weather(...) that raises
RuntimeError("Service unavailable") so the stub is readable and has the correct
async behavior.
- Around line 25-27: Replace the manual _skip_if_no_key() calls with a
centralized pytest skip so tests can't forget to check the env var: either add a
module-level pytestmark = pytest.mark.skipif(not os.getenv("GOOGLE_API_KEY"),
reason="GOOGLE_API_KEY not set") or create a session-scoped fixture (e.g.,
require_google_api_key) that checks os.getenv("GOOGLE_API_KEY") and calls
pytest.skip(...) if missing, then remove all calls to _skip_if_no_key() from
individual test functions; keep references to the existing helper name
_skip_if_no_key in case you want to deprecate/redirect it to the fixture for
backwards compatibility.
- Around line 93-94: The return type for get_weather uses an unparameterized
dict; update its annotation to a parameterized generic such as dict[str, Any]
(or a more specific mapping like dict[str, int | str]) and import Any from
typing; also find any other functions in this file that currently return bare
dict (e.g., the later async function returning a dict) and update their return
annotations similarly to use dict[str, Any] or a more specific type to follow
the modern typing guidelines.
In `@examples/01_simple_agent_example/simple_agent_example.py`:
- Line 2: Replace the deprecated typing.Dict usage with the built-in generic
dict: update the import line to remove Dict (keep Any if used) and change type
annotations like Dict[...] to dict[...] (e.g., in simple_agent_example.py
replace any Dict[str, Any] or Dict[...] at the import and at the usage on line
38 with dict[str, Any] or the appropriate dict[...] form); ensure all
occurrences of Dict are removed from imports and replaced in annotations while
preserving Any and other types.
In `@tests/test_testing/test_eval.py`:
- Around line 168-172: Replace the direct private-state mutation
response._cursor = 1 with the public API that advances or asserts consumption on
TestResponse; instead of setting _cursor directly in test_pass_at_end, call the
appropriate public method on response (for example response.consume_event() or
response.assert_events_consumed(1) / response.advance(n) depending on the
available API) so the test uses _make_response/_simple_events and
response.no_more_events() without touching private attributes.
_FakeLLM is a fake (working substitute with simplified logic), not a mock (call recorder with verification). The "never mock" guideline refers to unittest.mock / mock.patch, not test fakes. No change needed there, but the missing type annotations were a valid gap.
When mocks contained a valid tool followed by an unregistered one, the valid tool was already replaced before the KeyError was raised. Since the error happened before the try block, finally never ran and the LLM was left in a permanently mutated state. Fix: validate all tool names before swapping any implementations.
Remove _cursor field, _advance_to_type(), no_more_events(), and auto-skip of FunctionCallOutputEvent. Assertion methods now do a simple linear scan of self.events for the first matching event type.
Judge is no longer wired through TestSession → TestResponse. Users create judge instances directly and call judge.evaluate(event, intent). TestResponse.judge() is replaced by assistant_message() which returns the first ChatMessageEvent. Judge.evaluate now accepts ChatMessageEvent instead of a raw content string.
Avoids hardcoding the class name inside its own methods — build, _truncate, _format_event, _format_events now use cls instead of referring to TestResponse directly.
Replace tuple[bool, str] with JudgeVerdict dataclass for named access (verdict.success, verdict.reason) and easier extensibility. Switch LLMJudge to JSON-based prompt, move _parse_verdict into LLMJudge, and build prompt outside try block.
Wrap tool callables in AsyncMock(side_effect=...) so tests can use standard unittest.mock assertions (assert_called_once, assert_called_with, etc.) alongside the existing event-based assertions on TestResponse.
…om TestResponse Call-counting assertions now live on the mock layer via AsyncMock (assert_called_once, assert_not_called, etc.). Event-based assertions (assert_function_called, assert_function_output) remain on TestResponse.
Pre-compute chat_messages list in build(), consistent with function_calls. Removes the method that searched events and raised on missing — callers now index into the list directly.
…ut for output Remove redundant event-based assert_function_called from test_weather_tool_call_mocked — mock layer already covers call verification. Keep assert_function_output since AsyncMock doesn't store side_effect return values.
Add pytest.ini with pythonpath and pytest/pytest-asyncio dependencies so users can cd into the example and run uv run py.test directly. Switch to absolute import for simple_agent_example module.
Previously the method raised on the first FunctionCallEvent if name or arguments didn't match. Now it scans all function_calls looking for a match, only raising if none is found. Fixes false failures when multiple calls to the same tool have different arguments.
Same scan-all fix as assert_function_called. Also make name a required positional argument so callers explicitly specify which tool's output they are checking.
Examples:
- Expected a call to 'get_weather(location='Berlin')', but no matching call was found.
- Expected an output {'temp': 70} from 'get_weather', but no matching output was found.
Motivation
Testing conversational AI agents today requires spinning up audio/video infrastructure, edge connections, and real model calls for every assertion. This makes tests hard to write.
vision_agents.testingprovides a lightweight, text-only testing layer that lets you verify agent behavior — tool calls, arguments, responses, and intent — using familiar pytest patterns.What's included
Core API
TestSession— async context manager that wraps an LLM for testing. Manages session lifecycle, captures events (tool calls + outputs + messages), and returns structured results.TestResponse— returned bysimple_response(). Carries events, timing, output text, and assertion methods:assert_function_called(name, arguments=)— assert a tool was called with expected args (partial match)assert_function_output(output=, is_error=)— assert tool outputfunction_calls— pre-computed list ofFunctionCallEventchat_messages— pre-computed list ofChatMessageEventmock_tools(llm, {...})— context manager to temporarily swap tool implementations without changing the schema visible to the LLMmock_functions(llm, {...})— context manager that wraps tools inAsyncMock(side_effect=callable)for call tracking via standardunittest.mockassertions (assert_called_once,assert_called_with,assert_not_called, etc.)TestSession.mock_functions({...})— convenience method, equivalent tomock_functions(session.llm, {...})LLM-as-judge
Judge— protocol for intent evaluation strategiesJudgeVerdict— dataclass withsuccess: boolandreason: strLLMJudge— default implementation backed by a separate LLM instance. Sends the agent's message + a target intent, gets a structured PASS/FAIL verdict.Event types
ChatMessageEvent,FunctionCallEvent,FunctionCallOutputEvent— normalized dataclasses representing what happened during a turnRunEvent— union type of all threeUsage
Verify a greeting:
Verify tool calls and response:
Mock tools with call tracking:
Design decisions
TestResponse(assert_function_called,assert_function_output) check what the LLM requested. Mock-based assertions viaAsyncMock(assert_called_once,assert_called_with) check that the function was actually invoked. Both layers work independently and can be combined.function_callsandchat_messagesare built once inTestResponse.build(), no lazy searching or hidden raises.TestResponse— you calljudge.evaluate(event, intent=)explicitly.assert_function_called("tool", arguments={"key": "val"})only checks specified keys. Extra arguments are ignored.@pytest.mark.asyncioneeded (asyncio_mode = auto), clean tracebacks via__tracebackhide__.Files
Test plan
uv run py.test tests/test_testing/ -m "not integration" -n auto— all unit tests passuv run ruff check .— no lint issuesuv run ruff format --check .— formatteduv run mypy— no type errorsuv run py.test examples/01_simple_agent_example/ -m integrationSummary by CodeRabbit
New Features
Tests