feat(examples): add EvalHub adapter for running arksim evals by itsarbit · Pull Request #183 · arklexai/arksim

itsarbit · 2026-06-25T23:16:47Z

Summary

Adds a proof-of-concept adapter that runs arksim's multi-turn agent simulation
and evaluation as an EvalHub benchmark provider,
under examples/integrations/evalhub/.

EvalHub points at a target agent endpoint; the adapter drives it with simulated
users across scenarios, scores the transcripts, and reports aggregate metrics to
EvalHub plus the transcripts and HTML report to MLflow.

How it maps

EvalHub	arksim
`JobSpec.model.url` / `.name`	target agent endpoint (`chat_completions` or `a2a`)
`JobSpec.model.auth.secret_ref` -> `/var/run/secrets/model/api-key`	bearer token on the agent's requests
`JobSpec.parameters` (freeform JSON)	scenarios + simulator/judge config (`ArksimJobParameters`)
scalar metrics	`overall_agent_score`, `goal_completion_score`, `turn_success_ratio`, `num_conversations`
MLflow artifacts	`simulation.json`, `evaluation.json`, `final_report.html`

What's included

arksim_evalhub/mapping.py - pure EvalHub<->arksim transforms (no I/O), unit-tested.
arksim_evalhub/adapter.py - ArksimAdapter(FrameworkAdapter), credential resolution, the sim+eval run, and main().
main.py / run_local.py - container and local entrypoints.
job.example.json, provider.yaml, Containerfile, requirements.txt, README.md.
Tests for the mapping and adapter wiring (no LLM calls; the sim/eval step is stubbed).

Verified

Real end-to-end local run (simulation -> evaluation -> metrics -> artifacts) against OpenAI and a local MLflow server; 5 metrics and 3 artifacts confirmed in MLflow.
36 example tests pass; ruff check + format clean; the full tests/unit suite is unaffected.

Out of scope (follow-ups)

Real-cluster contract (sidecar callbacks, mounted secret, MLflow via the sidecar proxy, OCI push, provider.yaml schema) - needs a live EvalHub deployment.
Richer per-metric / behavior-failure metrics (the data is already available from arksim's evaluation output).
TLS ca_cert for custom-CA target endpoints.

Note on credentials

Despite the EvalHub model-authentication docs, the SDK only auto-attaches the
ServiceAccount token to control-plane callbacks. The adapter reads the model
api-key itself (read_model_auth_key("api-key")) and attaches it to the agent's
outbound requests. Worth confirming with the EvalHub team.

Runs arksim's multi-turn agent simulation and evaluation as an EvalHub benchmark provider (local-mode proof of concept). EvalHub points at a target agent endpoint; the adapter drives it with simulated users, scores the transcripts, and reports aggregate metrics to EvalHub plus the transcripts and HTML report to MLflow. Includes the FrameworkAdapter, a pure JobSpec<->arksim mapping module, local runner, sample job spec, provider template, Containerfile, and tests for the mapping and adapter wiring. Verified end-to-end against OpenAI and a local MLflow server.

Add a dedicated 'Eval platform integrations' section to the docs and a README pointer (EvalHub hosts arksim as a provider, so it isn't a framework connector). Convert the example README to the house Files table and enrich the sample scenarios to match sibling style.

mintlify · 2026-06-25T23:26:59Z

Preview deployment for your docs. Learn more about Mintlify Previews.

Project	Status	Preview	Updated (UTC)
arklex-ca4e8217	🟡 Building	–	Jun 25, 2026, 11:26 PM

💡 Tip: Enable Workflows to automatically generate PRs for you.

itsarbit requested a review from a team as a code owner June 25, 2026 23:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(examples): add EvalHub adapter for running arksim evals#183

feat(examples): add EvalHub adapter for running arksim evals#183
itsarbit wants to merge 2 commits into
mainfrom
feat/evalhub-adapter

itsarbit commented Jun 25, 2026

Uh oh!

mintlify Bot commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

itsarbit commented Jun 25, 2026

Summary

How it maps

What's included

Verified

Out of scope (follow-ups)

Note on credentials

Uh oh!

mintlify Bot commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant