M1: Runner API, canonical artifacts, CLI, and notebook by guru-code-expert · Pull Request #5 · AgentOpt/Trace-Bench

guru-code-expert · 2026-02-10T06:17:04Z

Implements the M1 milestone for Trace-Bench:

CLI surface:

trace-bench list-tasks, list-trainers, validate --config --strict, run, ui
Strict validation: trainer kwarg checking, optimizer/guide/logger resolution, trainable parameter detection, matrix expansion with manifest output

Runner & training:

BenchRunner with deterministic SHA256-based job IDs
Algorithm-aware kwarg mapping (PrioritySearch vs GEPA-Base/UCB/Beam)
DummyLLM stub mode for offline testing
Training error capture in feedback field

Canonical artifact layout:

meta/config.snapshot.yaml, manifest.json, env.json (redacted), git.json
Per-job: job_meta.json, results.json, events.jsonl, artifacts/, tb/
Run-level: results.csv (16 columns) + summary.json

Task coverage:

4 internal types (code_param, numeric_param, multi_param, non_trainable)
trace_examples:greeting_stub
llm4ad:circle_packing (bounded timeout)
veribench:smoke_placeholder (NotImplementedError stub)

Trainer coverage:

PrioritySearch + GEPA-Base exercised in real mode
GEPA-UCB + GEPA-Beam configured (M4 scope)

Tests: 30 pass, 2 skipped (m0 smoke, m1 artifacts, matrix e2e, internal tasks, opentrace examples, trainer config, veribench CLI)

Notebook: 01_m1_minimal_api.ipynb with Colab badge, auto-detect API key (real/stub mode), 2x2 matrix smoke (4/4 ok), executed outputs committed.

Implements the M1 milestone for Trace-Bench: CLI surface: - trace-bench list-tasks, list-trainers, validate --config --strict, run, ui - Strict validation: trainer kwarg checking, optimizer/guide/logger resolution, trainable parameter detection, matrix expansion with manifest output Runner & training: - BenchRunner with deterministic SHA256-based job IDs - Algorithm-aware kwarg mapping (PrioritySearch vs GEPA-Base/UCB/Beam) - DummyLLM stub mode for offline testing - Training error capture in feedback field Canonical artifact layout: - meta/config.snapshot.yaml, manifest.json, env.json (redacted), git.json - Per-job: job_meta.json, results.json, events.jsonl, artifacts/, tb/ - Run-level: results.csv (16 columns) + summary.json Task coverage: - 4 internal types (code_param, numeric_param, multi_param, non_trainable) - trace_examples:greeting_stub - llm4ad:circle_packing (bounded timeout) - veribench:smoke_placeholder (NotImplementedError stub) Trainer coverage: - PrioritySearch + GEPA-Base exercised in real mode - GEPA-UCB + GEPA-Beam configured (M4 scope) Tests: 30 pass, 2 skipped (m0 smoke, m1 artifacts, matrix e2e, internal tasks, opentrace examples, trainer config, veribench CLI) Notebook: 01_m1_minimal_api.ipynb with Colab badge, auto-detect API key (real/stub mode), 2x2 matrix smoke (4/4 ok), executed outputs committed.

This reverts commit 51622f2.

allenanie · 2026-02-27T17:27:20Z

trace_bench/registry.py

+    return names
+
+
+def discover_trainers() -> List[TrainerSpec]:


We only have a finite number of trainers. This is a very automated and cool way to automatically grab and sync with the opto-trace package, but it grabbed:

=== List trainers === AggregatedUpdate available BasicSearchAlgorithm available BeamSearch available BeamsearchAlgorithm available BeamsearchHistoryAlgorithm available GEPA-Base available GEPA-Beam available GEPA-UCB available Minibatch available MinibatchAlgorithm available PrioritySearch available PrioritySearch_with_Regressor available SearchTemplate available SequentialSearch available SequentialUpdate available StreamingPrioritySearch available UCBSearchAlgorithm available

SearchTemplate is not a full trainer.

@doxav can we maybe add a list of actual trainers to a __init__.py in the OpenTrace package (under experimental branch)?

Maybe, we could rather exclude some trainer class because the others are trainers ? We could simply start by an arg to discover_trainers() like excluded = ["SearchTemplate"]. Fixing trainers in a init.py would force to maintain a list and then prevent to automatically discover new trainers.

allenanie · 2026-02-27T17:31:33Z

trace_bench/results.py

+    "status",
+    "score_initial",
+    "score_final",
+    "score_best",


We actually need full list of scores for each iteration. Do we track those somewhere?
time_seconds will be tracked for each step as well

Does tb_logdir work? Can Colab show how I load and display this?

allenanie · 2026-02-27T17:33:01Z

trace_bench/runner.py

+    return getattr(model, "data", model)
+
+
+def _evaluate_bundle(bundle: Dict[str, Any]) -> Dict[str, Any]:


@doxav

Can we rename these? bundle is a reserved keyword in Trace meaning an abstraction for custom operation. Can we name this something else?

guru-code-expert added 8 commits February 10, 2026 11:12

notebook: use OPENROUTER_API_KEY

f2858e5

m1: align validation, veribench skip, and trainer discovery

8374498

Update 01_m1_minimal_api.ipynb

51622f2

Revert "Update 01_m1_minimal_api.ipynb"

61713b9

This reverts commit 51622f2.

FIX M1-critical items

6c588da

Update 01_m1_minimal_api.ipynb

bd1188e

Update 01_m1_minimal_api.ipynb

cade4ea

allenanie reviewed Feb 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

M1: Runner API, canonical artifacts, CLI, and notebook#5

M1: Runner API, canonical artifacts, CLI, and notebook#5
guru-code-expert wants to merge 8 commits intoAgentOpt:mainfrom
guru-code-expert:m1/deliverable

guru-code-expert commented Feb 10, 2026

Uh oh!

allenanie Feb 27, 2026

Uh oh!

doxav Feb 28, 2026

Uh oh!

allenanie Feb 27, 2026

Uh oh!

allenanie Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return getattr(model, "data", model)


		def _evaluate_bundle(bundle: Dict[str, Any]) -> Dict[str, Any]:

Conversation

guru-code-expert commented Feb 10, 2026

Uh oh!

allenanie Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

doxav Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

allenanie Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

allenanie Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants