Refactor: CLI, results storage, model updates, and code quality fixes by g-despot · Pull Request #1 · weaviate-tutorials/weaviate-vibe-eval

g-despot · 2026-03-25T06:59:29Z

Summary

Major refactoring of the benchmark framework covering model updates, code quality fixes, results infrastructure, CI automation, and documentation.

Model updates

Added latest models: Claude Opus 4.6, Claude Sonnet 4.6, Claude Haiku 4.5, OpenAI o3, o4-mini
Updated default model references (judge, AnthropicModel default) from Claude 3.7 Sonnet to Claude Sonnet 4.6
Updated weaviate-client from 4.16.7 to 4.20.4 (pyproject.toml + Dockerfile)
Dockerfile updated from Python 3.9 to 3.10 to match pyproject.toml requires-python

Bug fixes

Fixed env var warning crash: was calling .split() on None values instead of iterating over variable names
Fixed GeminiModel: generation_config was built but never passed to the API call; now uses client.models.generate_content() with config=
Fixed batch_import canonical implementation: referenced undefined collection variable instead of products (2 places), added missing import os and Auth import
Fixed task ID grouping in summary output: multi-word task IDs like zero_shot_basic_semantic_search were split incorrectly; now uses prefix-based parsing

New features

CLI (weaviate_vibe_eval/cli.py): argparse-based entry point with subcommands: run, list, leaderboard, compare, trends, runs
Results storage (weaviate_vibe_eval/utils/results_store.py): Store results in a remote Weaviate BenchmarkRun collection with --store-in-weaviate flag
Leaderboard generation: Ranked model table from stored results
Run comparison: Diff two runs showing regressions/improvements per model+task
Trend tracking: weaviate-vibe-eval trends --model <id> shows pass rate over time with UP/DOWN indicators
Repetitions: --repetitions N runs each model+task N times with aggregated pass rate and duration stats
LLM judge failure analysis: When --use-judge is enabled, failed tasks get diagnosed with root cause, failure analysis, and suggested fix
Retry with backoff: All LLM API calls now retry 3 times with exponential backoff (2s/4s/8s)

CI/CD

Added .github/workflows/monthly-benchmark.yml: Monthly scheduled run (1st of each month) + manual workflow_dispatch with configurable models/tasks/repetitions
Results committed back to repo and stored in Weaviate

Code cleanup

Removed unused TaskType class, get_model_info() method, _is_api_based attribute, parallel_models parameter, ThreadPoolExecutor imports
Removed unused model_params, inputs, packages parameters from generate_and_execute()
Removed stale requirements.txt (pyproject.toml + uv.lock is the single source of truth)
Added missing weaviate_vibe_eval/utils/__init__.py
Added logging module usage throughout (replaces bare print() for internal messages)

Documentation

README.md: Complete rewrite with model/task tables, CLI usage, results storage docs, CI section
CLAUDE.md: New technical implementation guide covering architecture, file map, result formats, env vars, and docs dashboard integration
JSON output now includes metadata envelope (run_id, git_commit, timestamp, model list, repetitions)
Markdown reports include summary table with pass rate and duration columns

Improvements

8ce5e5c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: CLI, results storage, model updates, and code quality fixes#1

Refactor: CLI, results storage, model updates, and code quality fixes#1
g-despot wants to merge 1 commit intomainfrom
refactoring

g-despot commented Mar 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

g-despot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Model updates

Bug fixes

New features

CI/CD

Code cleanup

Documentation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

g-despot commented Mar 25, 2026 •

edited

Loading