[quantization] Support Evaluation of Qwen3-VL With MMMU-Pro Dataset#714
Conversation
| dataset=dataset, | ||
| subject=subject, | ||
| split="dev", | ||
| split="test", |
There was a problem hiding this comment.
I think you changed this because MMMU-pro doesn't have dev split. Please correct me if I'm wrong.
That being said, we should not construct few-shot examples from the MMMU-Pro test split. MMMU-Pro benchmark is designed as a stricter multimodal evaluation setting, including vision-only tasks. In such benchmarks, using labeled test samples as prompt exemplars can significantly distort the reported accuracy.
My recommendation is to not allow few-shot for MMMU-Pro and force/switch it to zero-shot.
The best structure is:
- For MMMU/MMMU, keep the original behavior: few-shot from
dev, evaluation onvalidation. - For MMMU/MMMU_Pro, since there is no separate exemplar split, evaluate with
n_shots=0. - Even if the user runs -
-mmmu_dataset=MMMU/MMMU_Pro --mmmu_n_shots=5, the code should not crash. It should print a warning and continue in zero-shot mode.
This is the least inconvenient option for users and the safest option for evaluation reliability.
Recommended code
- load_few_shot_examples()
def load_few_shot_examples(
dataset: str,
subject: str,
n_shots: int = 5,
) -> list[dict[str, Any]]:
"""
Load few-shot examples for a given MMMU subject.
For MMMU/MMMU, examples are loaded from the `dev` split.
For MMMU/MMMU_Pro, few-shot examples are not loaded because there is no
separate exemplar split in the current setup.
"""
if n_shots <= 0:
return []
if dataset == "MMMU/MMMU_Pro":
return []
ds = load_data(
dataset=dataset,
subject=subject,
split="dev",
n_samples=n_shots,
streaming=True,
)
return [get_item_mmmu(ex) for ex in ds]I would not use raise ValueError here. This is a low-level utility function, and it is not a great place to explain CLI usage to the user.
- evaluate_mmmu()
def evaluate_mmmu(
model,
processor,
dataset: str,
subjects: list[str] | None = None,
device: str | torch.device = "cuda",
n_shots: int = 5,
n_samples: int = -1,
max_new_tokens: int = 16,
max_seq_len: int | None = None,
temperature: float = 0.0,
verbose: bool = True,
) -> dict[str, tuple[int, int, int]]:
if dataset not in MMMU_DATASETS:
raise ValueError(f"Invalid dataset '{dataset}'")
if dataset == "MMMU/MMMU_Pro" and n_shots > 0:
if verbose:
print(
"[WARNING] MMMU-Pro few-shot evaluation is disabled because "
"no separate few-shot split is available. Running zero-shot "
"with n_shots=0."
)
n_shots = 0
if subjects is None or (len(subjects) == 1 and "mmmu" in subjects[0]):
subjects = MMMU_SUBJECTS[dataset]
# ..- evaluate_subject()
In the current PR, few-shot examples are loaded from test, and evaluation is also performed on test. Changing the original MMMU evaluation to test as well is likely an unintended regression.
eval_split = "validation" if dataset == "MMMU/MMMU" else "test"
test_data = load_data(
dataset=dataset,
subject=subject,
split=eval_split,
n_samples=n_samples,
streaming=True,
)There was a problem hiding this comment.
👍 Done.
What I've changed:
- Added explicit argument
splittoload_dataandload_few_shot_examplesfunctions. - Added
eval_splitandfew_shot_splitarguments toevaluate_subjectfunction. - Added logic to determine the above two "split" arguments in
evaluate_mmmufunction. - Forced zero-shot inference for "MMMU/MMMU_Pro" according to your 2nd suggestion.
| "MMMU/MMMU_Pro": [ | ||
| "standard (10 options)", | ||
| "standard (4 options)", | ||
| "vision", |
There was a problem hiding this comment.
When it comes to vision subset, here's a description about the subset in the hugginggface.
In this subset, questions are embedded within screenshots or photos, and models must integrate visual and textual information to answer correctly. No separate text is fed into the model.
Therefore, for MMMU-Pro vision, please avoid reusing build_few_shot_prompt(). The question and answer options are already embedded in the image, so injecting textual choices/few-shot examples changes the benchmark setting.
Suggested fix: add an image-only generation path for dataset == "MMMU/MMMU_Pro" and subject == "vision", ignore n_shots for this subset, and do not pass question/options text into the processor.
diff --git a/tico/quantization/evaluation/mmmu_eval_utils.py b/tico/quantization/evaluation/mmmu_eval_utils.py
index 2760e30..xxxxxxx 100644
--- a/tico/quantization/evaluation/mmmu_eval_utils.py
+++ b/tico/quantization/evaluation/mmmu_eval_utils.py
@@ -19,7 +19,10 @@ import torch
from datasets import load_dataset
-from tico.quantization.evaluation.vlm_eval_utils import generate_answer
+from tico.quantization.evaluation.vlm_eval_utils import (
+ generate_answer,
+ move_inputs_to_device,
+)
MMMU_DATASETS = ["MMMU/MMMU", "MMMU/MMMU_Pro"]
@@ -68,6 +71,56 @@ MMMU_SPLITS: dict[str, list[str]] = {
}
+MMMU_PRO_VISION_SUBJECT = "vision"
+
+
+def is_mmmu_pro_vision(dataset: str, subject: str) -> bool:
+ return dataset == "MMMU/MMMU_Pro" and subject == MMMU_PRO_VISION_SUBJECT
+
+
+@torch.no_grad()
+def generate_image_only_answer(
+ model,
+ processor,
+ image,
+ device: str | torch.device,
+ max_new_tokens: int = 16,
+ temperature: float = 0.0,
+ max_seq_len: int | None = None,
+) -> str:
+ """
+ Generate an answer from the image only.
+
+ This is used for MMMU-Pro's vision subset, where the question and answer
+ options are embedded in the image. Do not inject question/options/few-shot
+ text into the prompt for this subset.
+ """
+ messages = [
+ {
+ "role": "user",
+ "content": [
+ {"type": "image"},
+ ],
+ }
+ ]
+ prompt = processor.apply_chat_template(
+ messages,
+ tokenize=False,
+ add_generation_prompt=True,
+ )
+
+ processor_kwargs: dict[str, Any] = {
+ "text": prompt,
+ "images": image,
+ "return_tensors": "pt",
+ }
+ if max_seq_len is not None and max_seq_len > 0:
+ processor_kwargs["truncation"] = True
+ processor_kwargs["max_length"] = max_seq_len
+
+ inputs = processor(**processor_kwargs)
+ inputs = move_inputs_to_device(inputs, device)
+
+ do_sample = temperature > 0.0
+ gen_kwargs: dict[str, Any] = {
+ "max_new_tokens": max_new_tokens,
+ "do_sample": do_sample,
+ }
+ if do_sample:
+ gen_kwargs["temperature"] = temperature
+
+ out_ids = model.generate(**inputs, **gen_kwargs)
+ input_len = inputs["input_ids"].shape[1]
+ gen_ids = out_ids[0, input_len:]
+
+ return processor.tokenizer.decode(gen_ids, skip_special_tokens=True).strip()
+
+
def take_from_dataset(ds, start: int, n: int) -> Iterable[dict[str, Any]]:
assert start >= 0
i = 0
@@ -112,9 +165,11 @@ def load_data(
def get_item_mmmu(ex: dict[str, Any]) -> dict[str, Any]:
- choices = ex["options"]
+ choices = ex.get("options", [])
if isinstance(choices, str):
# Convert string "['choice1', 'choice2']" to a list ['choice1', 'choice2']
- choices = ast.literal_eval(choices)
+ choices = ast.literal_eval(choices) if choices else []
return {
"id": ex["id"],
"image": ex["image_1"] if "image_1" in ex else ex["image"],
"question": ex["question"] if "question" in ex else "",
"choices": choices,
- "answer": ex["answer"],
+ "answer": str(ex["answer"]),
}
@@ -200,15 +255,23 @@ def extract_answer(generated_text: str) -> str | None:
"""
text = generated_text.strip()
- # Look for standalone letter [A-H] at the beginning, e.g. "A", "a", "A.", "a.", "A. Answer", "A Answer"
- first_char_match = re.match(r"^([A-J])([.\s]+[^\s]+)?\.?$", text, re.IGNORECASE)
+ # Look for a letter at the beginning, e.g. "A", "A.", "(A)", "A Answer".
+ first_char_match = re.match(
+ r"^\s*\(?([A-J])\)?(?:[.)\s]|$)",
+ text,
+ re.IGNORECASE,
+ )
if first_char_match:
return first_char_match.group(1).upper()
+ # Common verbose outputs, e.g. "The answer is C", "Answer: C", "Option C".
+ answer_match = re.search(
+ r"\b(?:answer|option|choice)\s*(?:is|:)?\s*\(?([A-J])\)?\b",
+ text,
+ re.IGNORECASE,
+ )
+ if answer_match:
+ return answer_match.group(1).upper()
+
return text
@@ -286,9 +349,18 @@ def evaluate_subject(
A tuple of (correct_count, total_count, skipped_count).
"""
- few_shot_examples = load_few_shot_examples(
- dataset=dataset, subject=subject, n_shots=n_shots
- )
+ vision_only = is_mmmu_pro_vision(dataset, subject)
+
+ if vision_only:
+ if n_shots > 0 and verbose:
+ print(
+ "\n[WARNING] MMMU-Pro vision subset is evaluated image-only; "
+ f"ignoring n_shots={n_shots}."
+ )
+ few_shot_examples: list[dict[str, Any]] = []
+ else:
+ few_shot_examples = load_few_shot_examples(
+ dataset=dataset, subject=subject, n_shots=n_shots
+ )
test_data = load_data(
dataset=dataset,
@@ -320,25 +392,37 @@ def evaluate_subject(
item = get_item_mmmu(ex)
- prompt = build_few_shot_prompt(
- question=item["question"],
- choices=item["choices"],
- subject=subject,
- few_shot_examples=few_shot_examples,
- )
+ if vision_only:
+ prompt = "<image-only>"
+ else:
+ prompt = build_few_shot_prompt(
+ question=item["question"],
+ choices=item["choices"],
+ subject=subject,
+ few_shot_examples=few_shot_examples,
+ )
try:
- generated = generate_answer(
- model=model,
- processor=processor,
- question=prompt,
- image=item["image"],
- device=device,
- max_new_tokens=max_new_tokens,
- max_seq_len=max_seq_len,
- temperature=temperature,
- )
+ if vision_only:
+ generated = generate_image_only_answer(
+ model=model,
+ processor=processor,
+ image=item["image"],
+ device=device,
+ max_new_tokens=max_new_tokens,
+ max_seq_len=max_seq_len,
+ temperature=temperature,
+ )
+ else:
+ generated = generate_answer(
+ model=model,
+ processor=processor,
+ question=prompt,
+ image=item["image"],
+ device=device,
+ max_new_tokens=max_new_tokens,
+ max_seq_len=max_seq_len,
+ temperature=temperature,
+ )
except ValueError as error:
if "Mismatch in `image` token count between text and `input_ids`." in str(
error
@@ -365,8 +449,12 @@ def evaluate_subject(
if verbose:
print(f"\n[Sample {total}] Subject: {subject}")
- print(f"Q: {item['question'][:100]}...")
- print(f"Choices: {item['choices']}")
+ if vision_only:
+ print("Q: <embedded in image>")
+ print("Choices: <embedded in image>")
+ else:
+ print(f"Q: {item['question'][:100]}...")
+ print(f"Choices: {item['choices']}")
print(
f"Generated: {generated}, Predicted: {predicted}, Gold: {gold}, Correct: {is_correct}"
)There was a problem hiding this comment.
I've implemented your suggestion regarding the "vision" subset with slight corrections:
- I've put
generate_image_only_answertovlm_eval_utils.py. - Didn't touch
get_item_mmmufunction because I thinkchoices = ex.get("options", [])may conceal errors.
Also, it looks like giving only an image to the model is too challenging for it, because it generates image description rather than answering to question leading to 0 correct answers. Do you think that we should provide at least some textual context along with the image, saying something like "This image contains a picture and a multichoice question. Answer the question with a single letter."?
There was a problem hiding this comment.
I agree that a completely empty text prompt may make chat-style VLMs default to image captioning, which would not really measure the intended multiple-choice answering behavior.
I think adding a fixed, sample-agnostic instruction is acceptable.
For example, I would prefer something like:
Answer the multiple-choice question shown in the image. Return only one letter from A to J.
There was a problem hiding this comment.
👍 Done.
The accuracy has grown from 0 to 33%
| subject | correct | total | skipped | accuracy |
|---|---|---|---|---|
| vision | 3 | 9 | 1 | 0.3333 |
| n_shots = 0 | ||
|
|
||
| eval_split = "validation" if dataset == "MMMU/MMMU" else "test" | ||
| few_shot_split = "test" |
There was a problem hiding this comment.
Maybe?
| few_shot_split = "test" | |
| few_shot_split = "dev" |
There was a problem hiding this comment.
Well, "test" split is present in both MMMU and MMMU_Pro while "dev" is present in MMMU only...
If we need to choose a different few-shot split, we'll need to add some logic:
few_shot_split = "dev" if dataset == "MMMU/MMMU" else "test"But do we really need that?
There was a problem hiding this comment.
Ah, I thought that you applied forced zero shot in #714 (comment).
Is there any reason to use few shot evaluation? Official homepage also says they use zero shot evaluation. We might as well use zero shot on MMMU and MMMU-pro.
There was a problem hiding this comment.
Ah, I thought that you applied forced zero shot in #714 (comment).
Yes, but I returned it back after you had suggested forcing zero-shot for "vision" subset (I thought that assumed that other subsets could still employ few-shot).
Anyway, we can always apply --mmmu_n_shots=0 option, cat't we?
I'd preserve this flexibility 😃
There was a problem hiding this comment.
I got your point! Thank you.
This change refactors mmmu_eval_utils.py to support MMMU-Pro benchmark in addition to MMMU. TICO-DCO-1.0-Signed-off-by: d.savchenkov <d.savchenkov@partner.samsung.com>
What
This PR extends the MMMU evaluation module to support the MMMU-Pro benchmark (
MMMU/MMMU_Pro) in addition to the existing MMMU dataset (MMMU/MMMU). Themmmu_eval_utils.pymodule is refactored to be dataset-aware, allowing the same evaluation pipeline to handle both benchmarks through adatasetparameter.Why
MMMU is listed as a required benchmark in PTQ Evaluation — Qwen3-VL. MMMU-Pro differs from MMMU in that it includes vision-only questions and supports up to 10 answer options (vs. 4 in MMMU), requiring adjustments to the answer extraction and evaluation logic.
Implementation Details
tico/quantization/evaluation/mmmu_eval_utils.py** (refactored)MMMU_SUBJECTSandMMMU_SPLITSchanged from flat lists to dictionaries keyed by dataset name, adding MMMU-Pro subjects (standard (10 options),standard (4 options),vision) and splits (testonly)load_data(),load_few_shot_examples(),evaluate_subject(),evaluate_mmmu()— all gained adatasetparameter to select betweenMMMU/MMMUandMMMU/MMMU_Proget_item_mmmu()— handles MMMU-Pro's different field names (ex["image"]instead ofex["image_1"], missingquestionfield for vision-only items)extract_answer()— regex expanded from[A-H]to[A-J]to support up to 10 options in MMMU-Proevaluate_subject()— gracefully skips samples where the prompt exceedsmax_seq_len(catchesValueErrorfor image token count mismatch), and handles missingimage_2fieldvalidationtotest(MMMU-Pro only hastest)tico/quantization/wrapq/examples/quantize_qwen3_vl_with_gptq.py** (updated)--mmmu_datasetCLI argument with choicesMMMU/MMMUandMMMU/MMMU_Pro--mmmu_datasetinstead of--mmmu_subjectsalonedatasetparameter passed through toevaluate_mmmu()in both original and quantized model evaluation pathsExample
python tico/quantization/wrapq/examples/quantize_qwen3_vl_with_gptq.py \ --model=Qwen/Qwen3-VL-4B-Instruct \ --cache_dir=/home/d.savchenkov/models/qwen3-vl-4b \ --trust-remote-code \ --no_GPTQ \ --mmmu_dataset=MMMU/MMMU_Pro \ --mmmu_subjects=vision \ --mmmu_n_shots=5 \ --mmmu_n_samples=10 \ --embedding_weight_bits=16 \ --vision_patch_embed_weight_bits=16 \ --linear_weight_bits=16 \ --lm_head_weight_bits=16 \ --nsamples_for_qcalibration=10 \ --verboseNote
Now one needs to specify
--mmmu_datasetcommand-line option to choose betweenMMMU/MMMUandMMMU/MMMU_Pro.