[quantization] Support Evaluation of Qwen3-VL With llava-bench-in-the-wild Dataset#718
[quantization] Support Evaluation of Qwen3-VL With llava-bench-in-the-wild Dataset#718dvsav wants to merge 2 commits into
Conversation
…-wild Dataset This change adds support for llava-bench-in-the-wild benchmark to quantize_qwen3_vl_with_gptq.py. TICO-DCO-1.0-Signed-off-by: d.savchenkov <d.savchenkov@partner.samsung.com>
|
Just curisosity, why is the name |
This commit fixes title of benchmark name and relaxed assert when all samples were skipped and no evaluation results were collected. Additionally, exception string was reformulated by excluding word error when a prompt is too long. TICO-DCO-1.0-Signed-off-by: Evgenii Maltsev <e.maltsev@samsung.com>
fixed it in additional commit |
| print(f"{metric:<10} {value:.3f}") | ||
|
|
||
| if "llava_bench" in args.eval_tasks: | ||
| print("\n=== Llama Bench Evaluation (Original Model) ===") |
There was a problem hiding this comment.
| print("\n=== Llama Bench Evaluation (Original Model) ===") | |
| print("\n=== Llava Bench Evaluation (Quantized Model) ===") |
| results = evaluate_model_coco( | ||
| model=model, | ||
| processor=processor, | ||
| device=args.device, | ||
| dataset_name="llava_bench", |
There was a problem hiding this comment.
| results = evaluate_model_coco( | |
| model=model, | |
| processor=processor, | |
| device=args.device, | |
| dataset_name="llava_bench", | |
| results = evaluate_model_llava_bench( | |
| model=model, | |
| processor=processor, | |
| device=args.device, |
| results = evaluate_model_coco( | ||
| model=model, | ||
| processor=processor, | ||
| device=args.device, | ||
| dataset_name="llava_bench", |
There was a problem hiding this comment.
| results = evaluate_model_coco( | |
| model=model, | |
| processor=processor, | |
| device=args.device, | |
| dataset_name="llava_bench", | |
| results = evaluate_model_llava_bench( | |
| model=model, | |
| processor=processor, | |
| device=args.device, |
| def get_item_llava_bench_in_the_wild(ex: dict[str, Any]) -> dict[str, Any]: | ||
| return { | ||
| "image": ex["image"], | ||
| "question": ex["question"], | ||
| "id": ex["question_id"], | ||
| "image_id": ex["image_id"], | ||
| "file_name": ex["image_id"], | ||
| "golds": [ex["gpt_answer"]], |
There was a problem hiding this comment.
According to the dataset card, I think this currently might compute llava_bench scores at the wrong granularity.
get_item_llava_bench_in_the_wild() maps image_id to ex["image_id"], which is the image filename such as 001.jpg. However, LLaVA Bench can have multiple questions for the same image, each with a different question_id. Later, get_coco_scores_on_dataset() rebuilds predictions as a dict keyed by image_id, so multiple predictions for the same image overwrite each other:
res[img_id] = [caption]As a result, only the last prediction for an image is kept, while the references may contain multiple GPT answers for different questions on that same image. That means the metric compares one question’s prediction against answers from multiple different questions, which can distort the score.
I think the evaluation key should be unique per QA sample, not per image file. For example:
def get_item_llava_bench_in_the_wild(ex: dict[str, Any]) -> dict[str, Any]:
return {
"image": ex["image"],
"question": ex["question"],
"id": ex["question_id"],
"image_id": ex["question_id"], # unique evaluation key
"file_name": ex["image_id"], # original image filename, if needed
"golds": [ex["gpt_answer"]],
}| except (ValueError, RuntimeError) as error: | ||
| print(f"[WARNING] The prompt was too long. Skipping.") | ||
| print(f"{error}") | ||
| continue |
There was a problem hiding this comment.
Current error handling catches both ValueError and RuntimeError and treats all of them as "prompt too long". This can silently hide real evaluation failures such as CUDA OOM, tensor/device mismatch, shape mismatch, or model/processor incompatibility. Could we make the skip path conservative?
except (ValueError, RuntimeError) as error:
message = str(error).lower()
if not any(
marker in message
for marker in (
"too long",
"max_position_embeddings",
"maximum context length",
"sequence length",
)
):
raise
print("[WARNING] The prompt was too long. Skipping.")
print(f"{type(error).__name__}: {error}")
continueAlso, if every sample is skipped, returning {} makes the evaluation look successful but metric-less; raising RuntimeError would be safer.
if not results:
- print(
- "[WARNING] No evaluation results were collected (all samples were skipped)."
- )
- return {}
+ raise RuntimeError(
+ "No evaluation results were collected. "
+ "All samples may have been skipped due to prompt length errors."
+ )
What
This change adds support for evaluating Qwen3-VL models on the lmms-lab/llava-bench-in-the-wild benchmark, reusing the existing COCO evaluation infrastructure in
vlm_eval_utils.py.Why
COCO captioning is listed as a required benchmark for Qwen3-VL in PTQ Evaluation — Qwen3-VL. The llava-bench-in-the-wild dataset is a complementary benchmark to COCO as described in LLaVA Bench documentation. While COCO provides multiple reference captions per image, llava-bench-in-the-wild provides a single GPT-generated ground-truth answer per question, making it a valuable additional signal for VLM evaluation quality.
Dataset Format
Implementation Details
Two files were modified:
tico/quantization/evaluation/vlm_eval_utils.py— Extended the evaluation utilities to support the new dataset:get_item_llama_bench_in_the_wild()adapter that maps llava-bench-in-the-wild fields (question_id→id,image_id→image_id/file_name,gpt_answer→golds) to the common format consumed byget_coco_scores_on_dataset().get_item_coco()to return all required fields explicitly (id,image_id,file_name,golds) instead of using.get()with defaults."llama_bench"dataset entry in theDATASETSregistry with the new adapter andlmms-lab/llava-bench-in-the-wildas the candidate dataset.dataset_nameparameter toget_coco_scores_on_dataset()to dispatch to the correct adapter (get_item_cocoorget_item_llama_bench_in_the_wild) based on dataset name.try/except) aroundgenerate_answer()to gracefully skip prompts that are too long rather than crashing the evaluation.tico/quantization/wrapq/examples/quantize_qwen3_vl_with_gptq.py— Integrated llama_bench evaluation into the quantization script:evaluate_model_coco()to accept adataset_nameparameter, making it reusable for both COCO and llama_bench evaluations.evaluate_model_llama_bench()convenience wrapper for llama_bench evaluation."llama_bench"as a supported--eval_tasksoption in bothevaluate_original_model()andevaluate_quantized_model(), with proper section headers and metric reporting.Example
python tico/quantization/wrapq/examples/quantize_qwen3_vl_with_gptq.py \ --model=Qwen/Qwen3-VL-4B-Instruct \ --cache_dir=/home/d.savchenkov/models/qwen3-vl-4b \ --trust-remote-code \ --no_GPTQ \ --eval_tasks=llama_bench \ --nsamples_for_evaluation=50 \ --embedding_weight_bits=16 \ --vision_patch_embed_weight_bits=16 \ --linear_weight_bits=16 \ --lm_head_weight_bits=16 \ --nsamples_for_qcalibration=10 \ --verbose