Skip to content

Refactor: CLI, results storage, model updates, and code quality fixes#1

Open
g-despot wants to merge 1 commit intomainfrom
refactoring
Open

Refactor: CLI, results storage, model updates, and code quality fixes#1
g-despot wants to merge 1 commit intomainfrom
refactoring

Conversation

@g-despot
Copy link
Copy Markdown

@g-despot g-despot commented Mar 25, 2026

Summary

Major refactoring of the benchmark framework covering model updates, code quality fixes, results infrastructure, CI automation, and documentation.

Model updates

  • Added latest models: Claude Opus 4.6, Claude Sonnet 4.6, Claude Haiku 4.5, OpenAI o3, o4-mini
  • Updated default model references (judge, AnthropicModel default) from Claude 3.7 Sonnet to Claude Sonnet 4.6
  • Updated weaviate-client from 4.16.7 to 4.20.4 (pyproject.toml + Dockerfile)
  • Dockerfile updated from Python 3.9 to 3.10 to match pyproject.toml requires-python

Bug fixes

  • Fixed env var warning crash: was calling .split() on None values instead of iterating over variable names
  • Fixed GeminiModel: generation_config was built but never passed to the API call; now uses client.models.generate_content() with config=
  • Fixed batch_import canonical implementation: referenced undefined collection variable instead of products (2 places), added missing import os and Auth import
  • Fixed task ID grouping in summary output: multi-word task IDs like zero_shot_basic_semantic_search were split incorrectly; now uses prefix-based parsing

New features

  • CLI (weaviate_vibe_eval/cli.py): argparse-based entry point with subcommands: run, list, leaderboard, compare, trends, runs
  • Results storage (weaviate_vibe_eval/utils/results_store.py): Store results in a remote Weaviate BenchmarkRun collection with --store-in-weaviate flag
  • Leaderboard generation: Ranked model table from stored results
  • Run comparison: Diff two runs showing regressions/improvements per model+task
  • Trend tracking: weaviate-vibe-eval trends --model <id> shows pass rate over time with UP/DOWN indicators
  • Repetitions: --repetitions N runs each model+task N times with aggregated pass rate and duration stats
  • LLM judge failure analysis: When --use-judge is enabled, failed tasks get diagnosed with root cause, failure analysis, and suggested fix
  • Retry with backoff: All LLM API calls now retry 3 times with exponential backoff (2s/4s/8s)

CI/CD

  • Added .github/workflows/monthly-benchmark.yml: Monthly scheduled run (1st of each month) + manual workflow_dispatch with configurable models/tasks/repetitions
  • Results committed back to repo and stored in Weaviate

Code cleanup

  • Removed unused TaskType class, get_model_info() method, _is_api_based attribute, parallel_models parameter, ThreadPoolExecutor imports
  • Removed unused model_params, inputs, packages parameters from generate_and_execute()
  • Removed stale requirements.txt (pyproject.toml + uv.lock is the single source of truth)
  • Added missing weaviate_vibe_eval/utils/__init__.py
  • Added logging module usage throughout (replaces bare print() for internal messages)

Documentation

  • README.md: Complete rewrite with model/task tables, CLI usage, results storage docs, CI section
  • CLAUDE.md: New technical implementation guide covering architecture, file map, result formats, env vars, and docs dashboard integration
  • JSON output now includes metadata envelope (run_id, git_commit, timestamp, model list, repetitions)
  • Markdown reports include summary table with pass rate and duration columns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant