[quantization] Support Evaluation of Qwen3-VL With llava-bench-in-the-wild Dataset by dvsav · Pull Request #718 · Samsung/TICO

dvsav · 2026-05-19T12:52:32Z

What

This change adds support for evaluating Qwen3-VL models on the lmms-lab/llava-bench-in-the-wild benchmark, reusing the existing COCO evaluation infrastructure in vlm_eval_utils.py.

Why

COCO captioning is listed as a required benchmark for Qwen3-VL in PTQ Evaluation — Qwen3-VL. The llava-bench-in-the-wild dataset is a complementary benchmark to COCO as described in LLaVA Bench documentation. While COCO provides multiple reference captions per image, llava-bench-in-the-wild provides a single GPT-generated ground-truth answer per question, making it a valuable additional signal for VLM evaluation quality.

Dataset Format

question_id (int)
question (string) : question (prompt, request)
image (PIL.Image) : image
caption (string)
gpt_answer (string) : ground truth answer
category (string)
image_id (string) : image file name

Implementation Details

Two files were modified:

tico/quantization/evaluation/vlm_eval_utils.py — Extended the evaluation utilities to support the new dataset:
- Added get_item_llama_bench_in_the_wild() adapter that maps llava-bench-in-the-wild fields (question_id → id, image_id → image_id/file_name, gpt_answer → golds) to the common format consumed by get_coco_scores_on_dataset().
- Updated get_item_coco() to return all required fields explicitly (id, image_id, file_name, golds) instead of using .get() with defaults.
- Registered "llama_bench" dataset entry in the DATASETS registry with the new adapter and lmms-lab/llava-bench-in-the-wild as the candidate dataset.
- Added dataset_name parameter to get_coco_scores_on_dataset() to dispatch to the correct adapter (get_item_coco or get_item_llama_bench_in_the_wild) based on dataset name.
- Added error handling (try/except) around generate_answer() to gracefully skip prompts that are too long rather than crashing the evaluation.
tico/quantization/wrapq/examples/quantize_qwen3_vl_with_gptq.py — Integrated llama_bench evaluation into the quantization script:
- Refactored evaluate_model_coco() to accept a dataset_name parameter, making it reusable for both COCO and llama_bench evaluations.
- Added evaluate_model_llama_bench() convenience wrapper for llama_bench evaluation.
- Added "llama_bench" as a supported --eval_tasks option in both evaluate_original_model() and evaluate_quantized_model(), with proper section headers and metric reporting.

Example

python tico/quantization/wrapq/examples/quantize_qwen3_vl_with_gptq.py \
    --model=Qwen/Qwen3-VL-4B-Instruct \
    --cache_dir=/home/d.savchenkov/models/qwen3-vl-4b \
    --trust-remote-code \
    --no_GPTQ \
    --eval_tasks=llama_bench \
    --nsamples_for_evaluation=50 \
    --embedding_weight_bits=16 \
    --vision_patch_embed_weight_bits=16 \
    --linear_weight_bits=16 \
    --lm_head_weight_bits=16 \
    --nsamples_for_qcalibration=10 \
    --verbose

=== Llama Bench Evaluation (Original Model) ===
...
id: 7
image_id: 002.jpg
Q: Imagine the fragrance of the fruits in the image. How would you describe this to someone who has never had this fruit before?
pred: 'Sweet, floral, and slightly tangy with a hint of tropical fruitiness.'
pred_norm: 'sweet floral and slightly tangy with hint of tropical fruitiness'
golds[:10]: ["'The fragrance of the mangosteens in the image can be described as sweet and slightly floral, with a hint of citrus aroma. It is a delicate and pleasant smell that entices you to try the fruit.'"]
...

CIDEr      0.011
Bleu_1     0.063
Bleu_2     0.038
Bleu_3     0.023
Bleu_4     0.016

=== Llama Bench Evaluation (Original Model) ===
...
id: 7
image_id: 002.jpg
Q: Imagine the fragrance of the fruits in the image. How would you describe this to someone who has never had this fruit before?
pred: 'Rich, earthy, slightly sweet with hints of tropical fruit and a faint floral undertone, like a blend of dark berries and green leaves.'
pred_norm: 'rich earthy slightly sweet with hints of tropical fruit and faint floral undertone like blend of dark berries and green leaves'
golds[:10]: ["'The fragrance of the mangosteens in the image can be described as sweet and slightly floral, with a hint of citrus aroma. It is a delicate and pleasant smell that entices you to try the fruit.'"]
...

CIDEr      0.036
Bleu_1     0.092
Bleu_2     0.053
Bleu_3     0.030
Bleu_4     0.016

…-wild Dataset This change adds support for llava-bench-in-the-wild benchmark to quantize_qwen3_vl_with_gptq.py. TICO-DCO-1.0-Signed-off-by: d.savchenkov <d.savchenkov@partner.samsung.com>

mhs4670go · 2026-05-20T05:39:05Z

Just curisosity, why is the name llama_bench instead of llava bench?

This commit fixes title of benchmark name and relaxed assert when all samples were skipped and no evaluation results were collected. Additionally, exception string was reformulated by excluding word error when a prompt is too long. TICO-DCO-1.0-Signed-off-by: Evgenii Maltsev <e.maltsev@samsung.com>

Torrero

LGTM

Torrero · 2026-05-21T11:21:16Z

Just curisosity, why is the name llama_bench instead of llava bench?

fixed it in additional commit

mhs4670go · 2026-05-22T05:23:45Z

+                print(f"{metric:<10} {value:.3f}")
+
+        if "llava_bench" in args.eval_tasks:
+            print("\n=== Llama Bench Evaluation (Original Model) ===")


Suggested change

print("\n=== Llama Bench Evaluation (Original Model) ===")

print("\n=== Llava Bench Evaluation (Quantized Model) ===")

mhs4670go · 2026-05-22T05:29:07Z

+            results = evaluate_model_coco(
+                model=model,
+                processor=processor,
+                device=args.device,
+                dataset_name="llava_bench",


Suggested change

results = evaluate_model_coco(

model=model,

processor=processor,

device=args.device,

dataset_name="llava_bench",

results = evaluate_model_llava_bench(

model=model,

processor=processor,

device=args.device,

mhs4670go · 2026-05-22T05:29:20Z

+            results = evaluate_model_coco(
+                model=model,
+                processor=processor,
+                device=args.device,
+                dataset_name="llava_bench",


Suggested change

results = evaluate_model_coco(

model=model,

processor=processor,

device=args.device,

dataset_name="llava_bench",

results = evaluate_model_llava_bench(

model=model,

processor=processor,

device=args.device,

mhs4670go · 2026-05-22T05:40:36Z

+def get_item_llava_bench_in_the_wild(ex: dict[str, Any]) -> dict[str, Any]:
+    return {
+        "image": ex["image"],
+        "question": ex["question"],
+        "id": ex["question_id"],
+        "image_id": ex["image_id"],
+        "file_name": ex["image_id"],
+        "golds": [ex["gpt_answer"]],


According to the dataset card, I think this currently might compute llava_bench scores at the wrong granularity.

get_item_llava_bench_in_the_wild() maps image_id to ex["image_id"], which is the image filename such as 001.jpg. However, LLaVA Bench can have multiple questions for the same image, each with a different question_id. Later, get_coco_scores_on_dataset() rebuilds predictions as a dict keyed by image_id, so multiple predictions for the same image overwrite each other:

res[img_id] = [caption]

As a result, only the last prediction for an image is kept, while the references may contain multiple GPT answers for different questions on that same image. That means the metric compares one question’s prediction against answers from multiple different questions, which can distort the score.

I think the evaluation key should be unique per QA sample, not per image file. For example:

def get_item_llava_bench_in_the_wild(ex: dict[str, Any]) -> dict[str, Any]: return { "image": ex["image"], "question": ex["question"], "id": ex["question_id"], "image_id": ex["question_id"], # unique evaluation key "file_name": ex["image_id"], # original image filename, if needed "golds": [ex["gpt_answer"]], }

mhs4670go · 2026-05-22T05:44:33Z

+        except (ValueError, RuntimeError) as error:
+            print(f"[WARNING] The prompt was too long. Skipping.")
+            print(f"{error}")
+            continue


Current error handling catches both ValueError and RuntimeError and treats all of them as "prompt too long". This can silently hide real evaluation failures such as CUDA OOM, tensor/device mismatch, shape mismatch, or model/processor incompatibility. Could we make the skip path conservative?

except (ValueError, RuntimeError) as error: message = str(error).lower() if not any( marker in message for marker in ( "too long", "max_position_embeddings", "maximum context length", "sequence length", ) ): raise print("[WARNING] The prompt was too long. Skipping.") print(f"{type(error).__name__}: {error}") continue

Also, if every sample is skipped, returning {} makes the evaluation look successful but metric-less; raising RuntimeError would be safer.

if not results: - print( - "[WARNING] No evaluation results were collected (all samples were skipped)." - ) - return {} + raise RuntimeError( + "No evaluation results were collected. " + "All samples may have been skipped due to prompt length errors." + )

[quantization] Support Evaluation of Qwen3-VL With llava-bench-in-the…

326456e

…-wild Dataset This change adds support for llava-bench-in-the-wild benchmark to quantize_qwen3_vl_with_gptq.py. TICO-DCO-1.0-Signed-off-by: d.savchenkov <d.savchenkov@partner.samsung.com>

dvsav marked this pull request as ready for review May 19, 2026 15:30

dvsav requested a review from Torrero May 19, 2026 15:31

Torrero approved these changes May 21, 2026

View reviewed changes

mhs4670go reviewed May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[quantization] Support Evaluation of Qwen3-VL With llava-bench-in-the-wild Dataset#718

[quantization] Support Evaluation of Qwen3-VL With llava-bench-in-the-wild Dataset#718
dvsav wants to merge 2 commits into
Samsung:mainfrom
dvsav:llama_bench

dvsav commented May 19, 2026 •

edited

Loading

Uh oh!

mhs4670go commented May 20, 2026

Uh oh!

Torrero left a comment

Uh oh!

Torrero commented May 21, 2026

Uh oh!

mhs4670go May 22, 2026

Uh oh!

mhs4670go May 22, 2026

Uh oh!

mhs4670go May 22, 2026

Uh oh!

mhs4670go May 22, 2026 •

edited

Loading

Uh oh!

mhs4670go May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	print("\n=== Llama Bench Evaluation (Original Model) ===")
	print("\n=== Llava Bench Evaluation (Quantized Model) ===")

Conversation

dvsav commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Dataset Format

Implementation Details

Example

Uh oh!

mhs4670go commented May 20, 2026

Uh oh!

Torrero left a comment

Choose a reason for hiding this comment

Uh oh!

Torrero commented May 21, 2026

Uh oh!

mhs4670go May 22, 2026

Choose a reason for hiding this comment

Uh oh!

mhs4670go May 22, 2026

Choose a reason for hiding this comment

Uh oh!

mhs4670go May 22, 2026

Choose a reason for hiding this comment

Uh oh!

mhs4670go May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhs4670go May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dvsav commented May 19, 2026 •

edited

Loading

mhs4670go May 22, 2026 •

edited

Loading