Skip to content

docs: add LLM-to-LLM conversation eval example#4041

Draft
deepujain wants to merge 1 commit into
UKGovernmentBEIS:mainfrom
deepujain:issue-2803-llm-conversation-docs
Draft

docs: add LLM-to-LLM conversation eval example#4041
deepujain wants to merge 1 commit into
UKGovernmentBEIS:mainfrom
deepujain:issue-2803-llm-conversation-docs

Conversation

@deepujain
Copy link
Copy Markdown
Contributor

This PR contains:

  • New features
  • Changes to dev-tools e.g. CI config / github tooling
  • Docs
  • Bug fixes
  • Code refactor

What is the current behavior? (You can also link to an open issue here)

Fixes #2803.

The multi-agent docs describe handoffs, tools, and explicit agent workflows, but they do not include a compact example for fixed-turn LLM-to-LLM evaluations where two agents converse and a third model judges the transcript. Users trying to build client-agent/customer-service-agent style evals have to infer the pattern from lower-level agent APIs.

What is the new behavior?

The multi-agent docs now include an LLM-to-LLM conversation example that:

  • Alternates a client agent and service agent through an explicit solver workflow.
  • Uses AgentState and run() so the two agents share and extend the same conversation history.
  • Adds a third judge model scorer over the transcript.
  • Notes that submit=False is useful when the outer workflow controls turn-taking, and points users to message/token/time limits for tool-using agents that may run too long.

Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)

No. This is a docs-only addition for existing agent and scorer APIs.

Other information:

Validation:

  • Targeted:
    • uv run pytest tests/agent/test_agent_execute.py -v
      • 31 passed, 6 skipped locally.
      • Confirms the run() execution path used by the example.
      • Confirms agent invocation as solver/tool/handoff and limit handling, which supports the guidance about bounded agent runs.
    • Reviewed the docs example against the current react, run, AgentState, solver, and scorer APIs after rebasing onto current origin/main.
  • Regression:
    • uv run make check
    • uv run make test
    • Commit-time pre-commit hooks passed: large-file, JSON/YAML, debug-statement, private-key, requirements, and docs typo checks.

CI/CD coverage expected:

  • Standard Build workflow should cover ruff, ruff format, mypy, pre-commit, package build, and pytest on Python 3.10 and 3.11.
  • Docs are excluded from ruff's source lint, but the pre-commit typos hook covers markdown/docs text.
  • Log viewer and sandbox-tools special gates should not require extra artifacts because this PR only changes multi-agent docs.

Closes #2803.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi-Turn LLM to LLM conversations toy example evaluations

1 participant