Skip to content

Add text-only testing framework for agents#364

Merged
aliev merged 52 commits intomainfrom
testing-framework
Feb 24, 2026
Merged

Add text-only testing framework for agents#364
aliev merged 52 commits intomainfrom
testing-framework

Conversation

@aliev
Copy link
Member

@aliev aliev commented Feb 20, 2026

Motivation

Testing conversational AI agents today requires spinning up audio/video infrastructure, edge connections, and real model calls for every assertion. This makes tests hard to write.

vision_agents.testing provides a lightweight, text-only testing layer that lets you verify agent behavior — tool calls, arguments, responses, and intent — using familiar pytest patterns.

What's included

Core API

  • TestSession — async context manager that wraps an LLM for testing. Manages session lifecycle, captures events (tool calls + outputs + messages), and returns structured results.
  • TestResponse — returned by simple_response(). Carries events, timing, output text, and assertion methods:
    • assert_function_called(name, arguments=) — assert a tool was called with expected args (partial match)
    • assert_function_output(output=, is_error=) — assert tool output
    • function_calls — pre-computed list of FunctionCallEvent
    • chat_messages — pre-computed list of ChatMessageEvent
  • mock_tools(llm, {...}) — context manager to temporarily swap tool implementations without changing the schema visible to the LLM
  • mock_functions(llm, {...}) — context manager that wraps tools in AsyncMock(side_effect=callable) for call tracking via standard unittest.mock assertions (assert_called_once, assert_called_with, assert_not_called, etc.)
  • TestSession.mock_functions({...}) — convenience method, equivalent to mock_functions(session.llm, {...})

LLM-as-judge

  • Judge — protocol for intent evaluation strategies
  • JudgeVerdict — dataclass with success: bool and reason: str
  • LLMJudge — default implementation backed by a separate LLM instance. Sends the agent's message + a target intent, gets a structured PASS/FAIL verdict.

Event types

  • ChatMessageEvent, FunctionCallEvent, FunctionCallOutputEvent — normalized dataclasses representing what happened during a turn
  • RunEvent — union type of all three

Usage

Verify a greeting:

async def test_greeting():
    judge = LLMJudge(gemini.LLM(MODEL))
    async with TestSession(llm=llm, instructions="Be friendly") as session:
        response = await session.simple_response("Hello")
        assert response.function_calls == []
        verdict = await judge.evaluate(response.chat_messages[0], intent="Friendly greeting")
        assert verdict.success, verdict.reason

Verify tool calls and response:

async def test_weather():
    judge = LLMJudge(gemini.LLM(MODEL))
    async with TestSession(llm=llm, instructions="...") as session:
        response = await session.simple_response("Weather in Tokyo?")
        response.assert_function_called("get_weather", arguments={"location": "Tokyo"})
        verdict = await judge.evaluate(response.chat_messages[0], intent="Reports weather for Tokyo")
        assert verdict.success, verdict.reason

Mock tools with call tracking:

async def test_with_mock():
    async with TestSession(llm=llm, instructions="...") as session:
        with session.mock_functions({"get_weather": lambda **_: {"temp": 70}}) as mocked:
            response = await session.simple_response("Weather in Berlin?")
            mocked["get_weather"].assert_called_once()
            mocked["get_weather"].assert_called_with(location="Berlin")

Design decisions

  • Two assertion layers — event-based assertions on TestResponse (assert_function_called, assert_function_output) check what the LLM requested. Mock-based assertions via AsyncMock (assert_called_once, assert_called_with) check that the function was actually invoked. Both layers work independently and can be combined.
  • Pre-computed listsfunction_calls and chat_messages are built once in TestResponse.build(), no lazy searching or hidden raises.
  • Separate judge LLM — the judge gets its own LLM instance to avoid polluting the agent's conversation history. It's decoupled from TestResponse — you call judge.evaluate(event, intent=) explicitly.
  • Partial argument matchingassert_function_called("tool", arguments={"key": "val"}) only checks specified keys. Extra arguments are ignored.
  • pytest-native — no custom runner, no @pytest.mark.asyncio needed (asyncio_mode = auto), clean tracebacks via __tracebackhide__.

Files

agents-core/vision_agents/testing/
├── __init__.py          # public API + module docstring
├── _events.py           # ChatMessageEvent, FunctionCallEvent, FunctionCallOutputEvent
├── _judge.py            # Judge protocol, JudgeVerdict, LLMJudge
├── _mock_tools.py       # mock_tools() and mock_functions() context managers
├── _run_result.py       # TestResponse dataclass + assertion methods
└── _session.py          # TestSession session lifecycle

tests/test_testing/
├── test_eval.py         # unit tests for TestResponse assertions
└── test_mock_tools.py   # unit tests for mock_tools and mock_functions

examples/01_simple_agent_example/
└── test_simple_agent.py # integration test example

Test plan

  • uv run py.test tests/test_testing/ -m "not integration" -n auto — all unit tests pass
  • uv run ruff check . — no lint issues
  • uv run ruff format --check . — formatted
  • uv run mypy — no type errors
  • uv run py.test examples/01_simple_agent_example/ -m integration

Summary by CodeRabbit

  • New Features

    • Lightweight testing framework: session-based test runner, structured event events, TestResponse assertions, and automated intent judging for agent responses and tool calls.
    • Context manager to mock tool behavior during tests while preserving public tool interfaces.
    • Example updates: reusable LLM setup and instructions for sample agent.
  • Tests

    • Comprehensive unit and integration tests covering assertions, tool mocking, intent judging, and example workflows.

aliev added 16 commits February 19, 2026 15:09
Text-only testing SDK for Vision-Agents agents. Includes TestSession,
fluent assertion API (RunResult/RunAssert), LLM-based judge, mock_tools,
unit tests, integration test examples for 00_example and 01_simple_agent_example.
Covers all public classes, methods, event types, recommended patterns,
and architecture rationale for testing at the LLM level.
Replace cursor-based assertion classes (RunAssert, EventAssert, etc.)
with scenario-style methods on TestEval: user_says, agent_calls,
agent_responds, no_more_events. ~500 lines removed.
Align with core LLM method naming and clarify intent.
TestEval now only handles lifecycle and LLM communication.
TestResponse holds data (output, events, function_calls, duration_ms)
and assertion methods (agent_calls, judge, no_more_events).
Reflect current API: TestResponse with function_called/function_output/judge,
simple_response returns TestResponse, assertions on response not session.
Add table of contents, align tables, consolidate architecture section,
add second quick start example with tool calls.
- Remove unused TestSession alias from __init__.py
- Remove duplicate docstring, unused logging/os imports from _session.py
- Deduplicate _evals_verbose (import from _run_result instead of redefining)
- Fix class docstring/​__test__ ordering in TestEval and TestResponse
- Add -> None return type to _on_tool_start and _on_tool_end
- Make _advance_to_type generic via TypeVar, remove redundant assert isinstance()
- Remove unreachable RuntimeError, annotate _raise_with_debug_info as NoReturn
- Remove dead else branch in _format_events
- Replace except Exception with specific exceptions in _judge.py
- Remove from __future__ import annotations, use quoted forward refs
- Move evaluate_intent import to module level in _run_result.py
Drop VISION_AGENTS_EVALS_VERBOSE env var, associated print() calls,
and _evals_verbose flag — premature for v1, avoids documenting and
potentially deprecating a public env variable.

Remove from __future__ import annotations from _events.py and
_mock_tools.py (no forward references, Python 3.10+).
Replace two identical branches with a _VERDICTS mapping lookup,
removing duplicated string slicing logic.
Extract _truncate, _format_event, _format_events as static methods
on TestResponse — they only serve _raise_with_debug_info. Move magic
numbers into documented module-level constants.
Docs live in a separate repo, so the in-package README is redundant.
Move essential usage examples and key exports summary into the module
docstring where help() and IDE tooltips can surface them.
@coderabbitai
Copy link

coderabbitai bot commented Feb 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new vision_agents.testing package providing event dataclasses, a TestSession async test harness, TestResponse assertion helpers, an LLM-based intent judge, a mock_tools context manager, updated example wiring, and unit/integration tests for the testing primitives.

Changes

Cohort / File(s) Summary
Public package init
agents-core/vision_agents/testing/__init__.py
Adds package docstring and re-exports: TestSession, TestResponse, event types (ChatMessageEvent, FunctionCallEvent, FunctionCallOutputEvent, RunEvent) and mock_tools.
Event types
agents-core/vision_agents/testing/_events.py
New dataclasses: ChatMessageEvent, FunctionCallEvent, FunctionCallOutputEvent and union RunEvent for normalized test events.
Intent evaluation (judge)
agents-core/vision_agents/testing/_judge.py
New async evaluate_intent that runs a judge LLM with a strict system prompt, parses single-line PASS/FAIL verdicts, and restores original LLM instructions; includes parsing and robust error handling.
Tool mocking utility
agents-core/vision_agents/testing/_mock_tools.py
New mock_tools context manager to temporarily replace tool callables in an LLM's FunctionRegistry, validating presence and restoring originals on exit.
Session & runtime capture
agents-core/vision_agents/testing/_session.py
New TestSession async context manager wiring an InMemoryConversation to an LLM, subscribing to tool start/end events, capturing FunctionCallEvent/FunctionCallOutputEvent, and returning TestResponse via simple_response.
Test response & assertions
agents-core/vision_agents/testing/_run_result.py
New TestResponse dataclass with build() and assertion helpers: function_called, function_output, judge, and no_more_events, plus debug/formatting helpers and cursor-driven traversal.
Examples (refactor)
examples/01_simple_agent_example/simple_agent_example.py
Refactors example to use setup_llm and module INSTRUCTIONS, registers a weather tool via setup_llm.
Integration tests (examples)
examples/01_simple_agent_example/test_simple_agent.py
Adds integration tests using the testing utilities to validate greeting and a weather tool call (uses setup_llm and judge LLM).
Unit tests
tests/test_testing/test_eval.py, tests/test_testing/test_mock_tools.py
Adds comprehensive unit tests for TestResponse assertions and mock_tools behavior, covering success paths, edge cases, and restoration semantics.
Docs/rules
CLAUDE.md
Adds a single-line Python rule: "Never use from future import annotations".

Sequence Diagram(s)

sequenceDiagram
    participant User as User
    participant TestSession as TestSession
    participant LLM as LLM
    participant EventMgr as EventManager
    participant Conv as InMemoryConversation
    participant Judge as JudgeLLM

    User->>TestSession: __aenter__/start()
    TestSession->>LLM: apply test instructions
    TestSession->>EventMgr: subscribe ToolStart/ToolEnd
    TestSession->>Conv: attach conversation
    User->>TestSession: simple_response("text")
    TestSession->>LLM: send user message
    LLM->>EventMgr: emit ToolStart / ToolEnd
    EventMgr->>TestSession: _on_tool_start / _on_tool_end
    TestSession->>TestSession: record FunctionCallEvent & FunctionCallOutputEvent
    TestSession->>Conv: append assistant message
    TestSession-->>User: return TestResponse (events, output, duration, judge)
    User->>TestResponse: .function_called / .function_output / .judge
    alt .judge uses JudgeLLM
      TestResponse->>Judge: evaluate_intent(message, intent)
      Judge-->>TestResponse: PASS/FAIL verdict
    end
    User->>TestSession: __aexit__/close()
    TestSession->>EventMgr: unsubscribe events
    TestSession->>LLM: restore original instructions
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

The verdict opens like a slit — a single line,
PASS or FAIL, a cold incision in the script;
mocked hands replace the living ones, return like ash,
events stacked and named, small coffins on the page,
the test exhales: a precise, inevitable absence.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 32.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Add text-only testing framework for agents' accurately summarizes the main objective of this PR, which introduces a comprehensive testing framework for conversational agents.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch testing-framework

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Switch test imports from internal modules (_events, _run_result,
_mock_tools) to the public vision_agents.testing API. Replace direct
_advance_to_type() call with function_output() in test_explicit_output_check.
@aliev aliev changed the title Testing framework Add text-only testing framework for agents Feb 20, 2026
@aliev aliev marked this pull request as ready for review February 20, 2026 16:36
@aliev aliev requested a review from dangusev February 20, 2026 16:37
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (14)
examples/00_example/agent.py (1)

1-4: Add module-level logger

The file is missing a logging import and logger = logging.getLogger(__name__), which is required for all Python modules per the coding guidelines.

♻️ Proposed addition
+import logging
+
 from dotenv import load_dotenv
 from vision_agents.core import Agent, AgentLauncher, User, Runner
 from vision_agents.plugins import getstream, gemini
+
+logger = logging.getLogger(__name__)

As per coding guidelines: "Use module-level logger = logging.getLogger(__name__)."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/00_example/agent.py` around lines 1 - 4, Add a module-level logger
by importing the logging module and creating logger =
logging.getLogger(__name__) at the top of the module (near the other imports) so
this file (which defines/uses Agent, AgentLauncher, User, Runner and imports
getstream/gemini) follows the project coding guidelines for module-level
logging.
examples/01_simple_agent_example/simple_agent_example.py (1)

2-2: Replace deprecated Dict with built-in dict

Dict from typing is deprecated since Python 3.9. The guideline requires modern generic syntax.

♻️ Proposed fix
-from typing import Any, Dict
+from typing import Any
-    async def get_weather(location: str) -> Dict[str, Any]:
+    async def get_weather(location: str) -> dict[str, Any]:

As per coding guidelines: "Use modern syntax: ... dict[str, T] generics".

Also applies to: 38-38

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/01_simple_agent_example/simple_agent_example.py` at line 2, Replace
the deprecated typing.Dict usage with the built-in generic dict: update the
import line to remove Dict (keep Any if used) and change type annotations like
Dict[...] to dict[...] (e.g., in simple_agent_example.py replace any Dict[str,
Any] or Dict[...] at the import and at the usage on line 38 with dict[str, Any]
or the appropriate dict[...] form); ensure all occurrences of Dict are removed
from imports and replaced in annotations while preserving Any and other types.
tests/test_testing/test_eval.py (1)

168-172: Prefer the public API over direct _cursor mutation

response._cursor = 1 reaches into the private state of TestResponse. Use the public assertion API to advance the cursor instead, which also makes the test's intent clearer.

♻️ Proposed refactor
-    def test_pass_at_end(self):
+    async def test_pass_at_end(self):
         response = _make_response(_simple_events())
-        response._cursor = 1
+        await response.judge()  # consumes the only event, advances cursor past it
         response.no_more_events()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_testing/test_eval.py` around lines 168 - 172, Replace the direct
private-state mutation response._cursor = 1 with the public API that advances or
asserts consumption on TestResponse; instead of setting _cursor directly in
test_pass_at_end, call the appropriate public method on response (for example
response.consume_event() or response.assert_events_consumed(1) /
response.advance(n) depending on the available API) so the test uses
_make_response/_simple_events and response.no_more_events() without touching
private attributes.
agents-core/vision_agents/testing/_run_result.py (3)

220-232: _format_event for ChatMessageEvent truncates without _truncate(), losing the ... suffix.

Line 223 uses event.content[:_PREVIEW_MAX_LEN] directly, while _format_event for FunctionCallOutputEvent (line 230) delegates to _truncate() which appends "..." when the text exceeds the limit. Using _truncate consistently avoids confusing debug output where a long chat message is silently clipped without any visual indicator.

♻️ Suggested fix
     if isinstance(event, ChatMessageEvent):
-        preview = event.content[:_PREVIEW_MAX_LEN].replace("\n", "\\n")
+        preview = TestResponse._truncate(
+            event.content.replace("\n", "\\n")
+        )
         return f"ChatMessageEvent(role='{event.role}', content='{preview}')"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents-core/vision_agents/testing/_run_result.py` around lines 220 - 232, The
ChatMessageEvent branch in TestResponse._format_event currently slices
event.content directly (event.content[:_PREVIEW_MAX_LEN]) which silently
truncates without the "..." suffix; change it to call
TestResponse._truncate(event.content) (and still replace "\n" with "\\n" on the
truncated result) so that long chat messages get the same "..." indicator as
FunctionCallOutputEvent; update the ChatMessageEvent handling in _format_event
to use TestResponse._truncate and preserve event.role and newline escaping.

195-206: _advance_to_type silently skips non-matching events — document this behavior.

The while loop on line 198 advances past any event that doesn't match expected_type. If a user calls function_called() but a ChatMessageEvent precedes the FunctionCallEvent, it is silently consumed. This is consistent with the cursor-based design, but a brief note in the class docstring (or the method docstring) would help users understand they cannot "go back" to skipped events.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents-core/vision_agents/testing/_run_result.py` around lines 195 - 206, The
_advance_to_type method silently consumes non-matching events while advancing
the internal cursor, which can cause earlier events (e.g., a ChatMessageEvent)
to be skipped before a later FunctionCallEvent; update the documentation to make
this behavior explicit by adding a sentence to the class docstring or the
_advance_to_type docstring that states the method advances the cursor forward,
skips any events that don't match expected_type, and that skipped events cannot
be revisited (no backtracking), and reference the method name _advance_to_type
and the cursor semantics (self._cursor) so callers know to check event order
before calling helpers like function_called().

24-40: _judge_llm typed as Any — consider LLM | None for type safety.

The field is always either an LLM instance or None. Typing it as Any suppresses type-checker feedback on misuse. If avoiding a circular import is the concern, a TYPE_CHECKING-guarded import would let you annotate precisely.

As per coding guidelines, "Use type annotations everywhere. Use modern syntax: X | Y unions, dict[str, T] generics".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents-core/vision_agents/testing/_run_result.py` around lines 24 - 40, The
_judge_llm field is currently typed as Any—change it to a precise LLM | None
annotation to restore type-safety: add a TYPE_CHECKING-guarded import (from
typing import TYPE_CHECKING; if TYPE_CHECKING: from <appropriate_module> import
LLM) to avoid circular imports, then update the TestResponse field declaration
from "_judge_llm: Any = field(default=None, repr=False)" to "_judge_llm: LLM |
None = field(default=None, repr=False)" (keeping default and repr settings
intact) so type checkers see the correct type while runtime imports remain safe.
agents-core/vision_agents/testing/_session.py (3)

1-15: Imports that serve only type annotations could be guarded under TYPE_CHECKING.

EventManager (line 5) and RunEvent (line 13) are used solely for type hints. Per coding guidelines, use the TYPE_CHECKING guard for imports only needed by type annotations. This would require adding from __future__ import annotations to defer evaluation.

As per coding guidelines, "Use TYPE_CHECKING guard for imports only needed by type annotations".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents-core/vision_agents/testing/_session.py` around lines 1 - 15, Imports
used only for type annotations (EventManager and RunEvent) should be guarded by
TYPE_CHECKING and defer evaluation: add "from __future__ import annotations" at
the top, import "from typing import TYPE_CHECKING", then move the EventManager
and RunEvent imports into an "if TYPE_CHECKING:" block; update any references to
those symbols (EventManager, RunEvent) so they remain as forward-referenced
types. Ensure runtime behavior is unchanged and only annotation-only imports are
moved.

125-126: Hardcoded 5-second event wait timeout.

await self._event_manager.wait(timeout=5.0) uses a fixed timeout that may be too tight for slow LLM providers or unnecessarily long for fast mocks. Consider making it configurable via __init__ or simple_response parameter with a sensible default.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents-core/vision_agents/testing/_session.py` around lines 125 - 126, The
hardcoded 5.0s in await self._event_manager.wait(timeout=5.0) should be made
configurable: add an event wait timeout parameter (e.g., event_wait_timeout:
float = 5.0) to the class __init__ (or to the simple_response entry point if
more appropriate), store it as self._event_wait_timeout, and replace the literal
5.0 in the _event_manager.wait call with self._event_wait_timeout; update any
instantiations/tests that rely on the previous behavior to pass a shorter
timeout for mocks or leave the default for real providers.

76-85: close() does not reset _conversation or _event_manager, preventing clean restart.

After close(), _started is False, so start() can run again. But start() only runs when not self._started, and it unconditionally reassigns _conversation and _event_manager. So re-entry works — no bug here. However, holding references to stale objects after close may keep resources alive longer than necessary.

♻️ Optional cleanup
     async def close(self) -> None:
         """Clean up resources."""
         if not self._started:
             return
 
         if self._event_manager is not None:
             self._event_manager.unsubscribe(self._on_tool_start)
             self._event_manager.unsubscribe(self._on_tool_end)
 
+        self._event_manager = None
+        self._conversation = None
+        self._captured_events.clear()
         self._started = False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents-core/vision_agents/testing/_session.py` around lines 76 - 85, The
close() method unsubscribes handlers and flips _started but leaves references to
_conversation and _event_manager, which can keep resources alive; update close()
to also set self._conversation = None and self._event_manager = None (after
unsubscribing _on_tool_start/_on_tool_end) so the session releases stale objects
and can fully clean up before a restart; retain existing unsubscribe logic for
_on_tool_start and _on_tool_end and only nullify after those calls.
examples/00_example/test_agent.py (4)

30-44: Repetitive LLM/judge instantiation across tests — consider a fixture.

Every test creates gemini.LLM(MODEL) and gemini.LLM(MODEL) identically. A function-scoped fixture would DRY this up:

♻️ Sketch
`@pytest.fixture`
def llm():
    return gemini.LLM(MODEL)

`@pytest.fixture`
def judge_llm():
    return gemini.LLM(MODEL)

Also applies to: 47-63, 66-82, 85-105, 108-139, 142-160

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/00_example/test_agent.py` around lines 30 - 44, Multiple tests
repeatedly instantiate gemini.LLM(MODEL); introduce pytest fixtures to DRY this
up by creating a function-scoped fixture (e.g., llm) that returns
gemini.LLM(MODEL) and another fixture (e.g., judge_llm) for the judge LLM, then
update tests like test_greeting to accept llm and judge_llm as parameters and
pass those into TestEval instead of constructing gemini.LLM(MODEL) inline;
ensure fixtures are imported/defined in the test module so all tests that use
TestEval (and other tests noted) reuse the fixtures.

129-131: The generator-throw lambda is fragile and hard to read.

lambda location: (_ for _ in ()).throw(RuntimeError(...)) is a clever trick but obscure. A plain async def is clearer and also matches the async signature of the original tool, avoiding any sync/async mismatch:

♻️ Suggested alternative
-        with mock_tools(
-            llm,
-            {
-                "get_weather": lambda location: (_ for _ in ()).throw(
-                    RuntimeError("Service unavailable")
-                )
-            },
-        ):
+        async def _failing_weather(location: str) -> dict[str, str]:
+            raise RuntimeError("Service unavailable")
+
+        with mock_tools(llm, {"get_weather": _failing_weather}):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/00_example/test_agent.py` around lines 129 - 131, Replace the
obscure generator-throw lambda used for "get_weather" with an async function
that simply raises RuntimeError to match the original tool's async signature;
locate the "get_weather" entry in the test agent setup (the lambda: (_ for _ in
()).throw(RuntimeError("Service unavailable"))) and convert it to an async def
get_weather(...) that raises RuntimeError("Service unavailable") so the stub is
readable and has the correct async behavior.

25-27: Consider using a pytest fixture or skipUnless for the API key check.

Every test calls _skip_if_no_key() manually. A session-scoped fixture or pytest.mark.skipif at module level would reduce repetition and ensure the skip can't be accidentally omitted in a new test.

♻️ Suggested alternative
+@pytest.fixture(autouse=True)
+def _require_google_api_key():
+    if not os.getenv("GOOGLE_API_KEY"):
+        pytest.skip("GOOGLE_API_KEY not set")
+
-def _skip_if_no_key():
-    if not os.getenv("GOOGLE_API_KEY"):
-        pytest.skip("GOOGLE_API_KEY not set")

Then remove the _skip_if_no_key() calls from each test body.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/00_example/test_agent.py` around lines 25 - 27, Replace the manual
_skip_if_no_key() calls with a centralized pytest skip so tests can't forget to
check the env var: either add a module-level pytestmark = pytest.mark.skipif(not
os.getenv("GOOGLE_API_KEY"), reason="GOOGLE_API_KEY not set") or create a
session-scoped fixture (e.g., require_google_api_key) that checks
os.getenv("GOOGLE_API_KEY") and calls pytest.skip(...) if missing, then remove
all calls to _skip_if_no_key() from individual test functions; keep references
to the existing helper name _skip_if_no_key in case you want to
deprecate/redirect it to the fixture for backwards compatibility.

93-94: Use parameterized generic dict in return type annotations.

The coding guidelines require modern type syntax with full generics. dict should be dict[str, Any] (or more specific) for the return types.

As per coding guidelines, "Use modern syntax: X | Y unions, dict[str, T] generics, full Callable signatures".

Also applies to: 118-119

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/00_example/test_agent.py` around lines 93 - 94, The return type for
get_weather uses an unparameterized dict; update its annotation to a
parameterized generic such as dict[str, Any] (or a more specific mapping like
dict[str, int | str]) and import Any from typing; also find any other functions
in this file that currently return bare dict (e.g., the later async function
returning a dict) and update their return annotations similarly to use dict[str,
Any] or a more specific type to follow the modern typing guidelines.
agents-core/vision_agents/testing/_judge.py (1)

59-60: Add public getter for instructions to the LLM class to avoid accessing private _instructions directly.

Line 59 reads llm._instructions (a private attribute) and stores it to restore later. The LLM class provides set_instructions() as a public method but has no getter—requiring this code to reach into internal state. Add a read-only property instructions to the LLM base class:

`@property`
def instructions(self) -> str:
    """Get the current instructions."""
    return self._instructions

Then replace line 59 with original_instructions = llm.instructions. This maintains the same semantics while respecting encapsulation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents-core/vision_agents/testing/_judge.py` around lines 59 - 60, Add a
read-only public property instructions to the LLM class that returns the
internal _instructions (i.e., implement property instructions -> return
self._instructions) and then update the caller in _judge.py to use
llm.instructions instead of accessing llm._instructions directly; keep existing
set_instructions(...) behavior intact so callers like
set_instructions(_JUDGE_SYSTEM_PROMPT) still work and original_instructions is
obtained via the new instructions property.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@agents-core/vision_agents/testing/_mock_tools.py`:
- Around line 44-57: The loop that validates and swaps functions mutates
registry._functions before entering the try/finally, so a KeyError halfway
leaves some tools swapped; change mock_tools to first validate all tool names
against registry._functions (using func_def = registry._functions.get(tool_name)
and raising KeyError if any missing) and only after successful validation
populate originals and perform the swaps (assign func_def.function = mock_fn)
inside the try block so the finally restoration loop (iterating originals and
resetting func_def.function = original_fn) always runs; ensure you still use the
same identifiers (registry._functions, originals, func_def.function, mocks) so
existing logic and the restoration in the finally block remain unchanged.

In `@tests/test_testing/test_mock_tools.py`:
- Around line 9-13: The test stub _FakeLLM subclasses LLM but doesn't implement
all abstract methods and simple_response lacks type annotations; update it so it
either (A) becomes a plain helper function (e.g., fake_simple_response(text: str
= "", **kwargs) -> LLMResponseEvent) used by tests instead of subclassing LLM,
or (B) fully implements all abstract LLM methods and add proper type annotations
to simple_response as async def simple_response(self, text: str = "", **kwargs)
-> LLMResponseEvent so the class can be instantiated; reference _FakeLLM, LLM,
simple_response, and LLMResponseEvent when making the change.

---

Nitpick comments:
In `@agents-core/vision_agents/testing/_judge.py`:
- Around line 59-60: Add a read-only public property instructions to the LLM
class that returns the internal _instructions (i.e., implement property
instructions -> return self._instructions) and then update the caller in
_judge.py to use llm.instructions instead of accessing llm._instructions
directly; keep existing set_instructions(...) behavior intact so callers like
set_instructions(_JUDGE_SYSTEM_PROMPT) still work and original_instructions is
obtained via the new instructions property.

In `@agents-core/vision_agents/testing/_run_result.py`:
- Around line 220-232: The ChatMessageEvent branch in TestResponse._format_event
currently slices event.content directly (event.content[:_PREVIEW_MAX_LEN]) which
silently truncates without the "..." suffix; change it to call
TestResponse._truncate(event.content) (and still replace "\n" with "\\n" on the
truncated result) so that long chat messages get the same "..." indicator as
FunctionCallOutputEvent; update the ChatMessageEvent handling in _format_event
to use TestResponse._truncate and preserve event.role and newline escaping.
- Around line 195-206: The _advance_to_type method silently consumes
non-matching events while advancing the internal cursor, which can cause earlier
events (e.g., a ChatMessageEvent) to be skipped before a later
FunctionCallEvent; update the documentation to make this behavior explicit by
adding a sentence to the class docstring or the _advance_to_type docstring that
states the method advances the cursor forward, skips any events that don't match
expected_type, and that skipped events cannot be revisited (no backtracking),
and reference the method name _advance_to_type and the cursor semantics
(self._cursor) so callers know to check event order before calling helpers like
function_called().
- Around line 24-40: The _judge_llm field is currently typed as Any—change it to
a precise LLM | None annotation to restore type-safety: add a
TYPE_CHECKING-guarded import (from typing import TYPE_CHECKING; if
TYPE_CHECKING: from <appropriate_module> import LLM) to avoid circular imports,
then update the TestResponse field declaration from "_judge_llm: Any =
field(default=None, repr=False)" to "_judge_llm: LLM | None =
field(default=None, repr=False)" (keeping default and repr settings intact) so
type checkers see the correct type while runtime imports remain safe.

In `@agents-core/vision_agents/testing/_session.py`:
- Around line 1-15: Imports used only for type annotations (EventManager and
RunEvent) should be guarded by TYPE_CHECKING and defer evaluation: add "from
__future__ import annotations" at the top, import "from typing import
TYPE_CHECKING", then move the EventManager and RunEvent imports into an "if
TYPE_CHECKING:" block; update any references to those symbols (EventManager,
RunEvent) so they remain as forward-referenced types. Ensure runtime behavior is
unchanged and only annotation-only imports are moved.
- Around line 125-126: The hardcoded 5.0s in await
self._event_manager.wait(timeout=5.0) should be made configurable: add an event
wait timeout parameter (e.g., event_wait_timeout: float = 5.0) to the class
__init__ (or to the simple_response entry point if more appropriate), store it
as self._event_wait_timeout, and replace the literal 5.0 in the
_event_manager.wait call with self._event_wait_timeout; update any
instantiations/tests that rely on the previous behavior to pass a shorter
timeout for mocks or leave the default for real providers.
- Around line 76-85: The close() method unsubscribes handlers and flips _started
but leaves references to _conversation and _event_manager, which can keep
resources alive; update close() to also set self._conversation = None and
self._event_manager = None (after unsubscribing _on_tool_start/_on_tool_end) so
the session releases stale objects and can fully clean up before a restart;
retain existing unsubscribe logic for _on_tool_start and _on_tool_end and only
nullify after those calls.

In `@examples/00_example/agent.py`:
- Around line 1-4: Add a module-level logger by importing the logging module and
creating logger = logging.getLogger(__name__) at the top of the module (near the
other imports) so this file (which defines/uses Agent, AgentLauncher, User,
Runner and imports getstream/gemini) follows the project coding guidelines for
module-level logging.

In `@examples/00_example/test_agent.py`:
- Around line 30-44: Multiple tests repeatedly instantiate gemini.LLM(MODEL);
introduce pytest fixtures to DRY this up by creating a function-scoped fixture
(e.g., llm) that returns gemini.LLM(MODEL) and another fixture (e.g., judge_llm)
for the judge LLM, then update tests like test_greeting to accept llm and
judge_llm as parameters and pass those into TestEval instead of constructing
gemini.LLM(MODEL) inline; ensure fixtures are imported/defined in the test
module so all tests that use TestEval (and other tests noted) reuse the
fixtures.
- Around line 129-131: Replace the obscure generator-throw lambda used for
"get_weather" with an async function that simply raises RuntimeError to match
the original tool's async signature; locate the "get_weather" entry in the test
agent setup (the lambda: (_ for _ in ()).throw(RuntimeError("Service
unavailable"))) and convert it to an async def get_weather(...) that raises
RuntimeError("Service unavailable") so the stub is readable and has the correct
async behavior.
- Around line 25-27: Replace the manual _skip_if_no_key() calls with a
centralized pytest skip so tests can't forget to check the env var: either add a
module-level pytestmark = pytest.mark.skipif(not os.getenv("GOOGLE_API_KEY"),
reason="GOOGLE_API_KEY not set") or create a session-scoped fixture (e.g.,
require_google_api_key) that checks os.getenv("GOOGLE_API_KEY") and calls
pytest.skip(...) if missing, then remove all calls to _skip_if_no_key() from
individual test functions; keep references to the existing helper name
_skip_if_no_key in case you want to deprecate/redirect it to the fixture for
backwards compatibility.
- Around line 93-94: The return type for get_weather uses an unparameterized
dict; update its annotation to a parameterized generic such as dict[str, Any]
(or a more specific mapping like dict[str, int | str]) and import Any from
typing; also find any other functions in this file that currently return bare
dict (e.g., the later async function returning a dict) and update their return
annotations similarly to use dict[str, Any] or a more specific type to follow
the modern typing guidelines.

In `@examples/01_simple_agent_example/simple_agent_example.py`:
- Line 2: Replace the deprecated typing.Dict usage with the built-in generic
dict: update the import line to remove Dict (keep Any if used) and change type
annotations like Dict[...] to dict[...] (e.g., in simple_agent_example.py
replace any Dict[str, Any] or Dict[...] at the import and at the usage on line
38 with dict[str, Any] or the appropriate dict[...] form); ensure all
occurrences of Dict are removed from imports and replaced in annotations while
preserving Any and other types.

In `@tests/test_testing/test_eval.py`:
- Around line 168-172: Replace the direct private-state mutation
response._cursor = 1 with the public API that advances or asserts consumption on
TestResponse; instead of setting _cursor directly in test_pass_at_end, call the
appropriate public method on response (for example response.consume_event() or
response.assert_events_consumed(1) / response.advance(n) depending on the
available API) so the test uses _make_response/_simple_events and
response.no_more_events() without touching private attributes.

_FakeLLM is a fake (working substitute with simplified logic), not a
mock (call recorder with verification). The "never mock" guideline
refers to unittest.mock / mock.patch, not test fakes. No change needed
there, but the missing type annotations were a valid gap.
When mocks contained a valid tool followed by an unregistered one, the
valid tool was already replaced before the KeyError was raised. Since
the error happened before the try block, finally never ran and the LLM
was left in a permanently mutated state.

Fix: validate all tool names before swapping any implementations.
aliev added 19 commits February 23, 2026 22:25
Remove _cursor field, _advance_to_type(), no_more_events(), and
auto-skip of FunctionCallOutputEvent. Assertion methods now do a
simple linear scan of self.events for the first matching event type.
Judge is no longer wired through TestSession → TestResponse. Users
create judge instances directly and call judge.evaluate(event, intent).
TestResponse.judge() is replaced by assistant_message() which returns
the first ChatMessageEvent. Judge.evaluate now accepts ChatMessageEvent
instead of a raw content string.
Avoids hardcoding the class name inside its own methods — build,
_truncate, _format_event, _format_events now use cls instead of
referring to TestResponse directly.
Replace tuple[bool, str] with JudgeVerdict dataclass for named access
(verdict.success, verdict.reason) and easier extensibility. Switch
LLMJudge to JSON-based prompt, move _parse_verdict into LLMJudge,
and build prompt outside try block.
Wrap tool callables in AsyncMock(side_effect=...) so tests can use
standard unittest.mock assertions (assert_called_once, assert_called_with,
etc.) alongside the existing event-based assertions on TestResponse.
…om TestResponse

Call-counting assertions now live on the mock layer via AsyncMock
(assert_called_once, assert_not_called, etc.). Event-based assertions
(assert_function_called, assert_function_output) remain on TestResponse.
Pre-compute chat_messages list in build(), consistent with
function_calls. Removes the method that searched events and raised
on missing — callers now index into the list directly.
…ut for output

Remove redundant event-based assert_function_called from
test_weather_tool_call_mocked — mock layer already covers call
verification. Keep assert_function_output since AsyncMock doesn't
store side_effect return values.
Add pytest.ini with pythonpath and pytest/pytest-asyncio dependencies
so users can cd into the example and run uv run py.test directly.
Switch to absolute import for simple_agent_example module.
Previously the method raised on the first FunctionCallEvent if name or
arguments didn't match. Now it scans all function_calls looking for a
match, only raising if none is found. Fixes false failures when multiple
calls to the same tool have different arguments.
Same scan-all fix as assert_function_called. Also make name a required
positional argument so callers explicitly specify which tool's output
they are checking.
Examples:
- Expected a call to 'get_weather(location='Berlin')', but no matching call was found.
- Expected an output {'temp': 70} from 'get_weather', but no matching output was found.
@aliev aliev merged commit 7843a7a into main Feb 24, 2026
10 checks passed
@aliev aliev deleted the testing-framework branch February 24, 2026 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants