docs: add LLM-to-LLM conversation eval example by deepujain · Pull Request #4041 · UKGovernmentBEIS/inspect_ai

deepujain · 2026-05-26T04:10:07Z

This PR contains:

What is the current behavior? (You can also link to an open issue here)

The multi-agent docs describe handoffs, tools, and explicit agent workflows, but they do not include a compact example for fixed-turn LLM-to-LLM evaluations where two agents converse and a third model judges the transcript. Users trying to build client-agent/customer-service-agent style evals have to infer the pattern from lower-level agent APIs.

What is the new behavior?

The multi-agent docs now include an LLM-to-LLM conversation example that:

Alternates a client agent and service agent through an explicit solver workflow.
Uses AgentState and run() so the two agents share and extend the same conversation history.
Adds a third judge model scorer over the transcript.
Notes that submit=False is useful when the outer workflow controls turn-taking, and points users to message/token/time limits for tool-using agents that may run too long.

Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)

No. This is a docs-only addition for existing agent and scorer APIs.

Other information:

Validation:

Targeted:
- uv run pytest tests/agent/test_agent_execute.py -v
  - 31 passed, 6 skipped locally.
  - Confirms the run() execution path used by the example.
  - Confirms agent invocation as solver/tool/handoff and limit handling, which supports the guidance about bounded agent runs.
- Reviewed the docs example against the current react, run, AgentState, solver, and scorer APIs after rebasing onto current origin/main.
Regression:
- uv run make check
- uv run make test
- Commit-time pre-commit hooks passed: large-file, JSON/YAML, debug-statement, private-key, requirements, and docs typo checks.

CI/CD coverage expected:

Standard Build workflow should cover ruff, ruff format, mypy, pre-commit, package build, and pytest on Python 3.10 and 3.11.
Docs are excluded from ruff's source lint, but the pre-commit typos hook covers markdown/docs text.
Log viewer and sandbox-tools special gates should not require extra artifacts because this PR only changes multi-agent docs.

Closes #2803.

docs: add llm conversation eval example

48d400d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add LLM-to-LLM conversation eval example#4041

docs: add LLM-to-LLM conversation eval example#4041
deepujain wants to merge 1 commit into
UKGovernmentBEIS:mainfrom
deepujain:issue-2803-llm-conversation-docs

deepujain commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

deepujain commented May 26, 2026

This PR contains:

What is the current behavior? (You can also link to an open issue here)

What is the new behavior?

Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)

Other information:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant