Skip to content

M1: Runner API, canonical artifacts, CLI, and notebook#5

Open
guru-code-expert wants to merge 8 commits intoAgentOpt:mainfrom
guru-code-expert:m1/deliverable
Open

M1: Runner API, canonical artifacts, CLI, and notebook#5
guru-code-expert wants to merge 8 commits intoAgentOpt:mainfrom
guru-code-expert:m1/deliverable

Conversation

@guru-code-expert
Copy link

Implements the M1 milestone for Trace-Bench:

CLI surface:

  • trace-bench list-tasks, list-trainers, validate --config --strict, run, ui
  • Strict validation: trainer kwarg checking, optimizer/guide/logger resolution, trainable parameter detection, matrix expansion with manifest output

Runner & training:

  • BenchRunner with deterministic SHA256-based job IDs
  • Algorithm-aware kwarg mapping (PrioritySearch vs GEPA-Base/UCB/Beam)
  • DummyLLM stub mode for offline testing
  • Training error capture in feedback field

Canonical artifact layout:

  • meta/config.snapshot.yaml, manifest.json, env.json (redacted), git.json
  • Per-job: job_meta.json, results.json, events.jsonl, artifacts/, tb/
  • Run-level: results.csv (16 columns) + summary.json

Task coverage:

  • 4 internal types (code_param, numeric_param, multi_param, non_trainable)
  • trace_examples:greeting_stub
  • llm4ad:circle_packing (bounded timeout)
  • veribench:smoke_placeholder (NotImplementedError stub)

Trainer coverage:

  • PrioritySearch + GEPA-Base exercised in real mode
  • GEPA-UCB + GEPA-Beam configured (M4 scope)

Tests: 30 pass, 2 skipped (m0 smoke, m1 artifacts, matrix e2e, internal tasks, opentrace examples, trainer config, veribench CLI)

Notebook: 01_m1_minimal_api.ipynb with Colab badge, auto-detect API key (real/stub mode), 2x2 matrix smoke (4/4 ok), executed outputs committed.

Implements the M1 milestone for Trace-Bench:

CLI surface:
- trace-bench list-tasks, list-trainers, validate --config --strict, run, ui
- Strict validation: trainer kwarg checking, optimizer/guide/logger resolution,
  trainable parameter detection, matrix expansion with manifest output

Runner & training:
- BenchRunner with deterministic SHA256-based job IDs
- Algorithm-aware kwarg mapping (PrioritySearch vs GEPA-Base/UCB/Beam)
- DummyLLM stub mode for offline testing
- Training error capture in feedback field

Canonical artifact layout:
- meta/config.snapshot.yaml, manifest.json, env.json (redacted), git.json
- Per-job: job_meta.json, results.json, events.jsonl, artifacts/, tb/
- Run-level: results.csv (16 columns) + summary.json

Task coverage:
- 4 internal types (code_param, numeric_param, multi_param, non_trainable)
- trace_examples:greeting_stub
- llm4ad:circle_packing (bounded timeout)
- veribench:smoke_placeholder (NotImplementedError stub)

Trainer coverage:
- PrioritySearch + GEPA-Base exercised in real mode
- GEPA-UCB + GEPA-Beam configured (M4 scope)

Tests: 30 pass, 2 skipped (m0 smoke, m1 artifacts, matrix e2e, internal tasks,
opentrace examples, trainer config, veribench CLI)

Notebook: 01_m1_minimal_api.ipynb with Colab badge, auto-detect API key
(real/stub mode), 2x2 matrix smoke (4/4 ok), executed outputs committed.
return names


def discover_trainers() -> List[TrainerSpec]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only have a finite number of trainers. This is a very automated and cool way to automatically grab and sync with the opto-trace package, but it grabbed:

=== List trainers ===
AggregatedUpdate	available
BasicSearchAlgorithm	available
BeamSearch	available
BeamsearchAlgorithm	available
BeamsearchHistoryAlgorithm	available
GEPA-Base	available
GEPA-Beam	available
GEPA-UCB	available
Minibatch	available
MinibatchAlgorithm	available
PrioritySearch	available
PrioritySearch_with_Regressor	available
SearchTemplate	available
SequentialSearch	available
SequentialUpdate	available
StreamingPrioritySearch	available
UCBSearchAlgorithm	available

SearchTemplate is not a full trainer.

@doxav can we maybe add a list of actual trainers to a __init__.py in the OpenTrace package (under experimental branch)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, we could rather exclude some trainer class because the others are trainers ? We could simply start by an arg to discover_trainers() like excluded = ["SearchTemplate"]. Fixing trainers in a init.py would force to maintain a list and then prevent to automatically discover new trainers.

"status",
"score_initial",
"score_final",
"score_best",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually need full list of scores for each iteration. Do we track those somewhere?
time_seconds will be tracked for each step as well

Does tb_logdir work? Can Colab show how I load and display this?

return getattr(model, "data", model)


def _evaluate_bundle(bundle: Dict[str, Any]) -> Dict[str, Any]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@doxav

Can we rename these? bundle is a reserved keyword in Trace meaning an abstraction for custom operation. Can we name this something else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants