Skip to content

[quantization] Support Evaluation of Qwen3-VL With llava-bench-in-the-wild Dataset#718

Open
dvsav wants to merge 2 commits into
Samsung:mainfrom
dvsav:llama_bench
Open

[quantization] Support Evaluation of Qwen3-VL With llava-bench-in-the-wild Dataset#718
dvsav wants to merge 2 commits into
Samsung:mainfrom
dvsav:llama_bench

Conversation

@dvsav
Copy link
Copy Markdown
Contributor

@dvsav dvsav commented May 19, 2026

What

This change adds support for evaluating Qwen3-VL models on the lmms-lab/llava-bench-in-the-wild benchmark, reusing the existing COCO evaluation infrastructure in vlm_eval_utils.py.

Why

COCO captioning is listed as a required benchmark for Qwen3-VL in PTQ Evaluation — Qwen3-VL. The llava-bench-in-the-wild dataset is a complementary benchmark to COCO as described in LLaVA Bench documentation. While COCO provides multiple reference captions per image, llava-bench-in-the-wild provides a single GPT-generated ground-truth answer per question, making it a valuable additional signal for VLM evaluation quality.

Dataset Format

  • question_id (int)
  • question (string) : question (prompt, request)
  • image (PIL.Image) : image
  • caption (string)
  • gpt_answer (string) : ground truth answer
  • category (string)
  • image_id (string) : image file name

Implementation Details

Two files were modified:

  1. tico/quantization/evaluation/vlm_eval_utils.py — Extended the evaluation utilities to support the new dataset:

    • Added get_item_llama_bench_in_the_wild() adapter that maps llava-bench-in-the-wild fields (question_idid, image_idimage_id/file_name, gpt_answergolds) to the common format consumed by get_coco_scores_on_dataset().
    • Updated get_item_coco() to return all required fields explicitly (id, image_id, file_name, golds) instead of using .get() with defaults.
    • Registered "llama_bench" dataset entry in the DATASETS registry with the new adapter and lmms-lab/llava-bench-in-the-wild as the candidate dataset.
    • Added dataset_name parameter to get_coco_scores_on_dataset() to dispatch to the correct adapter (get_item_coco or get_item_llama_bench_in_the_wild) based on dataset name.
    • Added error handling (try/except) around generate_answer() to gracefully skip prompts that are too long rather than crashing the evaluation.
  2. tico/quantization/wrapq/examples/quantize_qwen3_vl_with_gptq.py — Integrated llama_bench evaluation into the quantization script:

    • Refactored evaluate_model_coco() to accept a dataset_name parameter, making it reusable for both COCO and llama_bench evaluations.
    • Added evaluate_model_llama_bench() convenience wrapper for llama_bench evaluation.
    • Added "llama_bench" as a supported --eval_tasks option in both evaluate_original_model() and evaluate_quantized_model(), with proper section headers and metric reporting.

Example

python tico/quantization/wrapq/examples/quantize_qwen3_vl_with_gptq.py \
    --model=Qwen/Qwen3-VL-4B-Instruct \
    --cache_dir=/home/d.savchenkov/models/qwen3-vl-4b \
    --trust-remote-code \
    --no_GPTQ \
    --eval_tasks=llama_bench \
    --nsamples_for_evaluation=50 \
    --embedding_weight_bits=16 \
    --vision_patch_embed_weight_bits=16 \
    --linear_weight_bits=16 \
    --lm_head_weight_bits=16 \
    --nsamples_for_qcalibration=10 \
    --verbose
=== Llama Bench Evaluation (Original Model) ===
...
id: 7
image_id: 002.jpg
Q: Imagine the fragrance of the fruits in the image. How would you describe this to someone who has never had this fruit before?
pred: 'Sweet, floral, and slightly tangy with a hint of tropical fruitiness.'
pred_norm: 'sweet floral and slightly tangy with hint of tropical fruitiness'
golds[:10]: ["'The fragrance of the mangosteens in the image can be described as sweet and slightly floral, with a hint of citrus aroma. It is a delicate and pleasant smell that entices you to try the fruit.'"]
...

CIDEr      0.011
Bleu_1     0.063
Bleu_2     0.038
Bleu_3     0.023
Bleu_4     0.016

=== Llama Bench Evaluation (Original Model) ===
...
id: 7
image_id: 002.jpg
Q: Imagine the fragrance of the fruits in the image. How would you describe this to someone who has never had this fruit before?
pred: 'Rich, earthy, slightly sweet with hints of tropical fruit and a faint floral undertone, like a blend of dark berries and green leaves.'
pred_norm: 'rich earthy slightly sweet with hints of tropical fruit and faint floral undertone like blend of dark berries and green leaves'
golds[:10]: ["'The fragrance of the mangosteens in the image can be described as sweet and slightly floral, with a hint of citrus aroma. It is a delicate and pleasant smell that entices you to try the fruit.'"]
...

CIDEr      0.036
Bleu_1     0.092
Bleu_2     0.053
Bleu_3     0.030
Bleu_4     0.016

…-wild Dataset

This change adds support for llava-bench-in-the-wild benchmark to quantize_qwen3_vl_with_gptq.py.

TICO-DCO-1.0-Signed-off-by: d.savchenkov <d.savchenkov@partner.samsung.com>
@dvsav dvsav marked this pull request as ready for review May 19, 2026 15:30
@dvsav dvsav requested a review from Torrero May 19, 2026 15:31
@mhs4670go
Copy link
Copy Markdown
Contributor

Just curisosity, why is the name llama_bench instead of llava bench?

This commit fixes title of benchmark name  and relaxed assert when all samples were skipped and no evaluation results were collected.
Additionally, exception string was reformulated by excluding word error when a prompt is too long.

TICO-DCO-1.0-Signed-off-by:  Evgenii Maltsev <e.maltsev@samsung.com>
Copy link
Copy Markdown
Contributor

@Torrero Torrero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Torrero
Copy link
Copy Markdown
Contributor

Torrero commented May 21, 2026

Just curisosity, why is the name llama_bench instead of llava bench?

fixed it in additional commit

print(f"{metric:<10} {value:.3f}")

if "llava_bench" in args.eval_tasks:
print("\n=== Llama Bench Evaluation (Original Model) ===")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print("\n=== Llama Bench Evaluation (Original Model) ===")
print("\n=== Llava Bench Evaluation (Quantized Model) ===")

Comment on lines +1206 to +1210
results = evaluate_model_coco(
model=model,
processor=processor,
device=args.device,
dataset_name="llava_bench",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
results = evaluate_model_coco(
model=model,
processor=processor,
device=args.device,
dataset_name="llava_bench",
results = evaluate_model_llava_bench(
model=model,
processor=processor,
device=args.device,

Comment on lines +1313 to +1317
results = evaluate_model_coco(
model=model,
processor=processor,
device=args.device,
dataset_name="llava_bench",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
results = evaluate_model_coco(
model=model,
processor=processor,
device=args.device,
dataset_name="llava_bench",
results = evaluate_model_llava_bench(
model=model,
processor=processor,
device=args.device,

Comment on lines +190 to +197
def get_item_llava_bench_in_the_wild(ex: dict[str, Any]) -> dict[str, Any]:
return {
"image": ex["image"],
"question": ex["question"],
"id": ex["question_id"],
"image_id": ex["image_id"],
"file_name": ex["image_id"],
"golds": [ex["gpt_answer"]],
Copy link
Copy Markdown
Contributor

@mhs4670go mhs4670go May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the dataset card, I think this currently might compute llava_bench scores at the wrong granularity.

get_item_llava_bench_in_the_wild() maps image_id to ex["image_id"], which is the image filename such as 001.jpg. However, LLaVA Bench can have multiple questions for the same image, each with a different question_id. Later, get_coco_scores_on_dataset() rebuilds predictions as a dict keyed by image_id, so multiple predictions for the same image overwrite each other:

res[img_id] = [caption]

As a result, only the last prediction for an image is kept, while the references may contain multiple GPT answers for different questions on that same image. That means the metric compares one question’s prediction against answers from multiple different questions, which can distort the score.

I think the evaluation key should be unique per QA sample, not per image file. For example:

def get_item_llava_bench_in_the_wild(ex: dict[str, Any]) -> dict[str, Any]:
    return {
        "image": ex["image"],
        "question": ex["question"],
        "id": ex["question_id"],
        "image_id": ex["question_id"],  # unique evaluation key
        "file_name": ex["image_id"],    # original image filename, if needed
        "golds": [ex["gpt_answer"]],
    }

Comment on lines +541 to +544
except (ValueError, RuntimeError) as error:
print(f"[WARNING] The prompt was too long. Skipping.")
print(f"{error}")
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current error handling catches both ValueError and RuntimeError and treats all of them as "prompt too long". This can silently hide real evaluation failures such as CUDA OOM, tensor/device mismatch, shape mismatch, or model/processor incompatibility. Could we make the skip path conservative?

        except (ValueError, RuntimeError) as error:
            message = str(error).lower()
            if not any(
                marker in message
                for marker in (
                    "too long",
                    "max_position_embeddings",
                    "maximum context length",
                    "sequence length",
                )
            ):
                raise

            print("[WARNING] The prompt was too long. Skipping.")
            print(f"{type(error).__name__}: {error}")
            continue

Also, if every sample is skipped, returning {} makes the evaluation look successful but metric-less; raising RuntimeError would be safer.

     if not results:
-        print(
-            "[WARNING] No evaluation results were collected (all samples were skipped)."
-        )
-        return {}
+        raise RuntimeError(
+            "No evaluation results were collected. "
+            "All samples may have been skipped due to prompt length errors."
+        )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants