Refactor: CLI, results storage, model updates, and code quality fixes#1
Open
Refactor: CLI, results storage, model updates, and code quality fixes#1
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Major refactoring of the benchmark framework covering model updates, code quality fixes, results infrastructure, CI automation, and documentation.
Model updates
Bug fixes
.split()onNonevalues instead of iterating over variable namesgeneration_configwas built but never passed to the API call; now usesclient.models.generate_content()withconfig=batch_importcanonical implementation: referenced undefinedcollectionvariable instead ofproducts(2 places), added missingimport osandAuthimportzero_shot_basic_semantic_searchwere split incorrectly; now uses prefix-based parsingNew features
weaviate_vibe_eval/cli.py): argparse-based entry point with subcommands:run,list,leaderboard,compare,trends,runsweaviate_vibe_eval/utils/results_store.py): Store results in a remote WeaviateBenchmarkRuncollection with--store-in-weaviateflagweaviate-vibe-eval trends --model <id>shows pass rate over time with UP/DOWN indicators--repetitions Nruns each model+task N times with aggregated pass rate and duration stats--use-judgeis enabled, failed tasks get diagnosed with root cause, failure analysis, and suggested fixCI/CD
.github/workflows/monthly-benchmark.yml: Monthly scheduled run (1st of each month) + manualworkflow_dispatchwith configurable models/tasks/repetitionsCode cleanup
TaskTypeclass,get_model_info()method,_is_api_basedattribute,parallel_modelsparameter,ThreadPoolExecutorimportsmodel_params,inputs,packagesparameters fromgenerate_and_execute()requirements.txt(pyproject.toml + uv.lock is the single source of truth)weaviate_vibe_eval/utils/__init__.pyloggingmodule usage throughout (replaces bareprint()for internal messages)Documentation