M3: finalize gradio ui, traceability artifacts, and logger override support by guru-code-expert · Pull Request #7 · AgentOpt/Trace-Bench

guru-code-expert · 2026-03-05T07:54:43Z

Summary

M3 delivery for Gradio launch/monitor + traceability + logger override compatibility.

What’s included

Gradio UI now supports dynamic list/run/resume/display/edit flow (non-hardcoded).
Provider selector added: custom | openai | openrouter with base URL/key auto-fill behavior.
Launch UX improvements:
- auto-load tasks/trainers/configs
- bench-change task refresh
- explicit config source (picker | upload | editor)
- default mode override = real
Browse/Inspector improvements:
- recursive run discovery from parent runs_dir
- clearer run/job selection from lists
- TensorBoard action returns URL or explicit command/instructions
Traceability outputs:
- prints run paths after trace-bench run
- generates leaderboard.csv (excluding failed jobs)
- generates meta/files_index.json
Trainable state artifacts per job:
- initial_state, best_state, final_state (yaml + json)
- state_history.jsonl
- visible in Job Inspector
LLM metadata/token scope added to artifacts/results:
- provider/model/base_url
- token_scope=trace_optimization_only
- token fields for trace optimization usage
Logger override support:
- CLI: trace-bench run --logger ...
- UI dropdown: default | none | <logger>
- robust fallback for unknown/unsupported logger paths

Compatibility

Works with OpenTrace logger robustness patch: M3/logger override minimal OpenTrace#65
Trace-Bench includes fallback behavior so runs remain resilient even if upstream patch is not yet merged.

Validation

PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 pytest -q tests/m3 passing (42 passed, 1 skipped)
Updated notebook with reproducible setup and executed M3 flow:
- notebooks/03_ui_launch_monitor.ipynb

Implements the M1 milestone for Trace-Bench: CLI surface: - trace-bench list-tasks, list-trainers, validate --config --strict, run, ui - Strict validation: trainer kwarg checking, optimizer/guide/logger resolution, trainable parameter detection, matrix expansion with manifest output Runner & training: - BenchRunner with deterministic SHA256-based job IDs - Algorithm-aware kwarg mapping (PrioritySearch vs GEPA-Base/UCB/Beam) - DummyLLM stub mode for offline testing - Training error capture in feedback field Canonical artifact layout: - meta/config.snapshot.yaml, manifest.json, env.json (redacted), git.json - Per-job: job_meta.json, results.json, events.jsonl, artifacts/, tb/ - Run-level: results.csv (16 columns) + summary.json Task coverage: - 4 internal types (code_param, numeric_param, multi_param, non_trainable) - trace_examples:greeting_stub - llm4ad:circle_packing (bounded timeout) - veribench:smoke_placeholder (NotImplementedError stub) Trainer coverage: - PrioritySearch + GEPA-Base exercised in real mode - GEPA-UCB + GEPA-Beam configured (M4 scope) Tests: 30 pass, 2 skipped (m0 smoke, m1 artifacts, matrix e2e, internal tasks, opentrace examples, trainer config, veribench CLI) Notebook: 01_m1_minimal_api.ipynb with Colab badge, auto-detect API key (real/stub mode), 2x2 matrix smoke (4/4 ok), executed outputs committed.

This reverts commit 51622f2.

…nd crossbench smoke

# Conflicts: # LLM4AD/benchmark_tasks/science_discovery_bactgrow/__init__.py # LLM4AD/benchmark_tasks/science_discovery_ode_1d/__init__.py # LLM4AD/benchmark_tasks/science_discovery_oscillator1/__init__.py # LLM4AD/benchmark_tasks/science_discovery_oscillator2/__init__.py # LLM4AD/benchmark_tasks/science_discovery_stresstrain/__init__.py # configs/m2_coverage.yaml # notebooks/02_m2_coverage.ipynb # notebooks/04_m2_full_coverage.ipynb # trace_bench/runner.py

…ndle building

…coverage 56/58 OK

…splay

…fig runner passthrough

- Runner passthrough for objective_config - 3 benchmark tasks: convex (SixHumpCamel), BBEH (boolean_expressions), GSM8K - Each task supports weighted/pareto mode switching via eval_kwargs - 3 notebooks comparing BasicSearch/Beamsearch/PrioritySearch x weighted/pareto - BBEH + GSM8K notebooks compare two models with auto-detection (OpenRouter or direct API keys) - UsageTrackingLLM with __deepcopy__ support for ContextVar - 23 pytest tests (15 unit, 4 feature, 2 non-regression) - Config: m3_multiobjective.yaml (18-job matrix, max_workers=6) - End-to-end validated: convex 6/6, BBEH 10/12, GSM8K 6/12

…to m3/deliverable

m3: notebook with output

guru-code-expert added 30 commits February 10, 2026 11:12

notebook: use OPENROUTER_API_KEY

f2858e5

m1: align validation, veribench skip, and trainer discovery

8374498

Update 01_m1_minimal_api.ipynb

51622f2

Revert "Update 01_m1_minimal_api.ipynb"

61713b9

This reverts commit 51622f2.

FIX M1-critical items

6c588da

Update 01_m1_minimal_api.ipynb

bd1188e

Update 01_m1_minimal_api.ipynb

cade4ea

m2: matrix hardening, resume/timeout semantics, coverage notebook

964df62

m2: import baseline from m2/coverage (squashed)

6ed1ce6

m2: fix loader semantics, resolved metadata, subprocess robustness, a…

9995904

…nd crossbench smoke

m2: re-enable tsp coverage and align notebooks with deliverable-fix

bf3d97d

m2: align real-mode notebook messaging with 29-task coverage

b358478

m2: fix colab clone path and repo links for deliverable-fix

ee0081d

m2: fix cross-bench cell formatting in 02 notebook

49af108

m2: publish updated coverage notebook and drop openai variant

066b257

m2: drop optional openai fallback notebook from deliverable branch

247c689

m2: add real opentrace trace_examples to crossbench smoke

ee9df15

m2: stabilize optimizing subset and refresh coverage notebook outputs

2aa3974

m2: add veribench adapter discovery and smoke coverage

ee5eaa4

m2: refresh coverage notebook outputs after rerun

46aed9f

feat(veribench): add HF dataset fallback for real task discovery + bu…

7eb8ec9

…ndle building

Update M2 notebook: real VeriBench tasks (140 from HF dataset), stub …

f8f5f4a

…coverage 56/58 OK

M2: add VeriBench real-mode coverage (80%+ target) + min/max score di…

af39915

…splay

M2: VeriBench 139/140 real-mode coverage (99.3%), min/max score display

4d33545

m3: add dynamic Gradio UI launch/browse/resume with MLflow/TB wiring

79d4f08

m3: refine UI layout and fix trainer discovery/notebook setup

ceed751

m3: harden UI inputs, table rendering, and browse filters

27edd3a

m3: fix UI notebook real-mode deps (litellm/aiohttp)

8b61007

guru-code-expert and others added 16 commits February 27, 2026 16:32

m3: refresh Job Inspector runs and keep top tabs sticky

fdcd74e

M3: Add 3 multi-objective tasks (convex, BBEH, GSM8K) + objective_con…

136acf3

…fig runner passthrough

Add executed notebook pattern to gitignore

fbadb5e

Fix Colab compatibility: add badges, fix module imports, branch refs …

efd7924

…to m3/deliverable

merge: apply PR #1 multi-objective baseline before M3 acceptance fixes

eb94940

m3: implement client acceptance fixes for UI traceability and metadata

a64dcb6

m3: stabilize run discovery paths and optional objective tests

7797e5c

m3: improve provider UX, model loading, and run tables

553d4c9

m3: simplify model input and refine task/trainer list UX

758cbe2

m3: remove model loader controls and tighten methods/tasks layout

09972ea

m3: use OpenTrace logger-override branch in UI notebook setup

334f455

m3: honor model input in generated config and hide abstract trainers

5cb3874

m3: keep latest panel idle on load and bootstrap colab keys

9ed097f

m3: make notebook key bootstrap tolerant to missing secrets

fdea49b

Update 03_ui_launch_monitor.ipynb

2d64b4e

m3: notebook with output

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

M3: finalize gradio ui, traceability artifacts, and logger override support#7

M3: finalize gradio ui, traceability artifacts, and logger override support#7
guru-code-expert wants to merge 46 commits intoAgentOpt:mainfrom
guru-code-expert:m3/deliverable

guru-code-expert commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

guru-code-expert commented Mar 5, 2026

Summary

What’s included

Compatibility

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant