Skip to content

M3: finalize gradio ui, traceability artifacts, and logger override support#7

Open
guru-code-expert wants to merge 46 commits intoAgentOpt:mainfrom
guru-code-expert:m3/deliverable
Open

M3: finalize gradio ui, traceability artifacts, and logger override support#7
guru-code-expert wants to merge 46 commits intoAgentOpt:mainfrom
guru-code-expert:m3/deliverable

Conversation

@guru-code-expert
Copy link

Summary

M3 delivery for Gradio launch/monitor + traceability + logger override compatibility.

What’s included

  • Gradio UI now supports dynamic list/run/resume/display/edit flow (non-hardcoded).
  • Provider selector added: custom | openai | openrouter with base URL/key auto-fill behavior.
  • Launch UX improvements:
    • auto-load tasks/trainers/configs
    • bench-change task refresh
    • explicit config source (picker | upload | editor)
    • default mode override = real
  • Browse/Inspector improvements:
    • recursive run discovery from parent runs_dir
    • clearer run/job selection from lists
    • TensorBoard action returns URL or explicit command/instructions
  • Traceability outputs:
    • prints run paths after trace-bench run
    • generates leaderboard.csv (excluding failed jobs)
    • generates meta/files_index.json
  • Trainable state artifacts per job:
    • initial_state, best_state, final_state (yaml + json)
    • state_history.jsonl
    • visible in Job Inspector
  • LLM metadata/token scope added to artifacts/results:
    • provider/model/base_url
    • token_scope=trace_optimization_only
    • token fields for trace optimization usage
  • Logger override support:
    • CLI: trace-bench run --logger ...
    • UI dropdown: default | none | <logger>
    • robust fallback for unknown/unsupported logger paths

Compatibility

Validation

  • PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 pytest -q tests/m3 passing (42 passed, 1 skipped)
  • Updated notebook with reproducible setup and executed M3 flow:
    • notebooks/03_ui_launch_monitor.ipynb

Implements the M1 milestone for Trace-Bench:

CLI surface:
- trace-bench list-tasks, list-trainers, validate --config --strict, run, ui
- Strict validation: trainer kwarg checking, optimizer/guide/logger resolution,
  trainable parameter detection, matrix expansion with manifest output

Runner & training:
- BenchRunner with deterministic SHA256-based job IDs
- Algorithm-aware kwarg mapping (PrioritySearch vs GEPA-Base/UCB/Beam)
- DummyLLM stub mode for offline testing
- Training error capture in feedback field

Canonical artifact layout:
- meta/config.snapshot.yaml, manifest.json, env.json (redacted), git.json
- Per-job: job_meta.json, results.json, events.jsonl, artifacts/, tb/
- Run-level: results.csv (16 columns) + summary.json

Task coverage:
- 4 internal types (code_param, numeric_param, multi_param, non_trainable)
- trace_examples:greeting_stub
- llm4ad:circle_packing (bounded timeout)
- veribench:smoke_placeholder (NotImplementedError stub)

Trainer coverage:
- PrioritySearch + GEPA-Base exercised in real mode
- GEPA-UCB + GEPA-Beam configured (M4 scope)

Tests: 30 pass, 2 skipped (m0 smoke, m1 artifacts, matrix e2e, internal tasks,
opentrace examples, trainer config, veribench CLI)

Notebook: 01_m1_minimal_api.ipynb with Colab badge, auto-detect API key
(real/stub mode), 2x2 matrix smoke (4/4 ok), executed outputs committed.
# Conflicts:
#	LLM4AD/benchmark_tasks/science_discovery_bactgrow/__init__.py
#	LLM4AD/benchmark_tasks/science_discovery_ode_1d/__init__.py
#	LLM4AD/benchmark_tasks/science_discovery_oscillator1/__init__.py
#	LLM4AD/benchmark_tasks/science_discovery_oscillator2/__init__.py
#	LLM4AD/benchmark_tasks/science_discovery_stresstrain/__init__.py
#	configs/m2_coverage.yaml
#	notebooks/02_m2_coverage.ipynb
#	notebooks/04_m2_full_coverage.ipynb
#	trace_bench/runner.py
guru-code-expert and others added 16 commits February 27, 2026 16:32
- Runner passthrough for objective_config
- 3 benchmark tasks: convex (SixHumpCamel), BBEH (boolean_expressions), GSM8K
- Each task supports weighted/pareto mode switching via eval_kwargs
- 3 notebooks comparing BasicSearch/Beamsearch/PrioritySearch x weighted/pareto
- BBEH + GSM8K notebooks compare two models with auto-detection (OpenRouter or direct API keys)
- UsageTrackingLLM with __deepcopy__ support for ContextVar
- 23 pytest tests (15 unit, 4 feature, 2 non-regression)
- Config: m3_multiobjective.yaml (18-job matrix, max_workers=6)
- End-to-end validated: convex 6/6, BBEH 10/12, GSM8K 6/12
m3: notebook with output
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant