Skip to content

feat(examples): add EvalHub adapter for running arksim evals#183

Open
itsarbit wants to merge 2 commits into
mainfrom
feat/evalhub-adapter
Open

feat(examples): add EvalHub adapter for running arksim evals#183
itsarbit wants to merge 2 commits into
mainfrom
feat/evalhub-adapter

Conversation

@itsarbit

Copy link
Copy Markdown
Contributor

Summary

Adds a proof-of-concept adapter that runs arksim's multi-turn agent simulation
and evaluation as an EvalHub benchmark provider,
under examples/integrations/evalhub/.

EvalHub points at a target agent endpoint; the adapter drives it with simulated
users across scenarios, scores the transcripts, and reports aggregate metrics to
EvalHub plus the transcripts and HTML report to MLflow.

How it maps

EvalHub arksim
JobSpec.model.url / .name target agent endpoint (chat_completions or a2a)
JobSpec.model.auth.secret_ref -> /var/run/secrets/model/api-key bearer token on the agent's requests
JobSpec.parameters (freeform JSON) scenarios + simulator/judge config (ArksimJobParameters)
scalar metrics overall_agent_score, goal_completion_score, turn_success_ratio, num_conversations
MLflow artifacts simulation.json, evaluation.json, final_report.html

What's included

  • arksim_evalhub/mapping.py - pure EvalHub<->arksim transforms (no I/O), unit-tested.
  • arksim_evalhub/adapter.py - ArksimAdapter(FrameworkAdapter), credential resolution, the sim+eval run, and main().
  • main.py / run_local.py - container and local entrypoints.
  • job.example.json, provider.yaml, Containerfile, requirements.txt, README.md.
  • Tests for the mapping and adapter wiring (no LLM calls; the sim/eval step is stubbed).

Verified

  • Real end-to-end local run (simulation -> evaluation -> metrics -> artifacts) against OpenAI and a local MLflow server; 5 metrics and 3 artifacts confirmed in MLflow.
  • 36 example tests pass; ruff check + format clean; the full tests/unit suite is unaffected.

Out of scope (follow-ups)

  • Real-cluster contract (sidecar callbacks, mounted secret, MLflow via the sidecar proxy, OCI push, provider.yaml schema) - needs a live EvalHub deployment.
  • Richer per-metric / behavior-failure metrics (the data is already available from arksim's evaluation output).
  • TLS ca_cert for custom-CA target endpoints.

Note on credentials

Despite the EvalHub model-authentication docs, the SDK only auto-attaches the
ServiceAccount token to control-plane callbacks. The adapter reads the model
api-key itself (read_model_auth_key("api-key")) and attaches it to the agent's
outbound requests. Worth confirming with the EvalHub team.

Runs arksim's multi-turn agent simulation and evaluation as an EvalHub benchmark provider (local-mode proof of concept). EvalHub points at a target agent endpoint; the adapter drives it with simulated users, scores the transcripts, and reports aggregate metrics to EvalHub plus the transcripts and HTML report to MLflow.

Includes the FrameworkAdapter, a pure JobSpec<->arksim mapping module, local runner, sample job spec, provider template, Containerfile, and tests for the mapping and adapter wiring. Verified end-to-end against OpenAI and a local MLflow server.
@itsarbit itsarbit requested a review from a team as a code owner June 25, 2026 23:16
Add a dedicated 'Eval platform integrations' section to the docs and a README pointer (EvalHub hosts arksim as a provider, so it isn't a framework connector). Convert the example README to the house Files table and enrich the sample scenarios to match sibling style.
@mintlify

mintlify Bot commented Jun 25, 2026

Copy link
Copy Markdown

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
arklex-ca4e8217 🟡 Building Jun 25, 2026, 11:26 PM

💡 Tip: Enable Workflows to automatically generate PRs for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant