[quantization] Support Evaluation of Qwen3-VL With MMMU-Pro Dataset by dvsav · Pull Request #714 · Samsung/TICO

dvsav · 2026-05-18T15:17:48Z

What

This PR extends the MMMU evaluation module to support the MMMU-Pro benchmark (MMMU/MMMU_Pro) in addition to the existing MMMU dataset (MMMU/MMMU). The mmmu_eval_utils.py module is refactored to be dataset-aware, allowing the same evaluation pipeline to handle both benchmarks through a dataset parameter.

Why

MMMU is listed as a required benchmark in PTQ Evaluation — Qwen3-VL. MMMU-Pro differs from MMMU in that it includes vision-only questions and supports up to 10 answer options (vs. 4 in MMMU), requiring adjustments to the answer extraction and evaluation logic.

Implementation Details

`tico/quantization/evaluation/mmmu_eval_utils.py`** (refactored)

MMMU_SUBJECTS and MMMU_SPLITS changed from flat lists to dictionaries keyed by dataset name, adding MMMU-Pro subjects (standard (10 options), standard (4 options), vision) and splits (test only)
load_data(), load_few_shot_examples(), evaluate_subject(), evaluate_mmmu() — all gained a dataset parameter to select between MMMU/MMMU and MMMU/MMMU_Pro
get_item_mmmu() — handles MMMU-Pro's different field names (ex["image"] instead of ex["image_1"], missing question field for vision-only items)
extract_answer() — regex expanded from [A-H] to [A-J] to support up to 10 options in MMMU-Pro
evaluate_subject() — gracefully skips samples where the prompt exceeds max_seq_len (catches ValueError for image token count mismatch), and handles missing image_2 field
Evaluation split changed from validation to test (MMMU-Pro only has test)

`tico/quantization/wrapq/examples/quantize_qwen3_vl_with_gptq.py`** (updated)

Added --mmmu_dataset CLI argument with choices MMMU/MMMU and MMMU/MMMU_Pro
MMMU evaluation now triggered by --mmmu_dataset instead of --mmmu_subjects alone
dataset parameter passed through to evaluate_mmmu() in both original and quantized model evaluation paths

Example

python tico/quantization/wrapq/examples/quantize_qwen3_vl_with_gptq.py \
    --model=Qwen/Qwen3-VL-4B-Instruct \
    --cache_dir=/home/d.savchenkov/models/qwen3-vl-4b \
    --trust-remote-code \
    --no_GPTQ \
    --mmmu_dataset=MMMU/MMMU_Pro \
    --mmmu_subjects=vision \
    --mmmu_n_shots=5 \
    --mmmu_n_samples=10 \
    --embedding_weight_bits=16 \
    --vision_patch_embed_weight_bits=16 \
    --linear_weight_bits=16 \
    --lm_head_weight_bits=16 \
    --nsamples_for_qcalibration=10 \
    --verbose

=== MMMU Evaluation (Original Model) ===
| subject                                            | correct    | total      | skipped    | accuracy   |
| -------------------------------------------------- | ---------- | ---------- | ---------- | ---------- |
| vision                                             | 5          | 9          | 1          | 0.5556     |

=== MMMU Evaluation (Quantized Model) ===
| subject                                            | correct    | total      | skipped    | accuracy   |
| -------------------------------------------------- | ---------- | ---------- | ---------- | ---------- |
| vision                                             | 5          | 9          | 1          | 0.5556     |

Note

Now one needs to specify --mmmu_dataset command-line option to choose between MMMU/MMMU and MMMU/MMMU_Pro.

mhs4670go · 2026-05-19T01:08:44Z

+        dataset=dataset,
        subject=subject,
-        split="dev",
+        split="test",


I think you changed this because MMMU-pro doesn't have dev split. Please correct me if I'm wrong.

That being said, we should not construct few-shot examples from the MMMU-Pro test split. MMMU-Pro benchmark is designed as a stricter multimodal evaluation setting, including vision-only tasks. In such benchmarks, using labeled test samples as prompt exemplars can significantly distort the reported accuracy.

My recommendation is to not allow few-shot for MMMU-Pro and force/switch it to zero-shot.

The best structure is:

For MMMU/MMMU, keep the original behavior: few-shot from dev, evaluation on validation.

For MMMU/MMMU_Pro, since there is no separate exemplar split, evaluate with n_shots=0.

Even if the user runs --mmmu_dataset=MMMU/MMMU_Pro --mmmu_n_shots=5, the code should not crash. It should print a warning and continue in zero-shot mode.

This is the least inconvenient option for users and the safest option for evaluation reliability.

Recommended code

load_few_shot_examples()

def load_few_shot_examples( dataset: str, subject: str, n_shots: int = 5, ) -> list[dict[str, Any]]: """ Load few-shot examples for a given MMMU subject. For MMMU/MMMU, examples are loaded from the `dev` split. For MMMU/MMMU_Pro, few-shot examples are not loaded because there is no separate exemplar split in the current setup. """ if n_shots <= 0: return [] if dataset == "MMMU/MMMU_Pro": return [] ds = load_data( dataset=dataset, subject=subject, split="dev", n_samples=n_shots, streaming=True, ) return [get_item_mmmu(ex) for ex in ds]

I would not use raise ValueError here. This is a low-level utility function, and it is not a great place to explain CLI usage to the user.

evaluate_mmmu()

def evaluate_mmmu( model, processor, dataset: str, subjects: list[str] | None = None, device: str | torch.device = "cuda", n_shots: int = 5, n_samples: int = -1, max_new_tokens: int = 16, max_seq_len: int | None = None, temperature: float = 0.0, verbose: bool = True, ) -> dict[str, tuple[int, int, int]]: if dataset not in MMMU_DATASETS: raise ValueError(f"Invalid dataset '{dataset}'") if dataset == "MMMU/MMMU_Pro" and n_shots > 0: if verbose: print( "[WARNING] MMMU-Pro few-shot evaluation is disabled because " "no separate few-shot split is available. Running zero-shot " "with n_shots=0." ) n_shots = 0 if subjects is None or (len(subjects) == 1 and "mmmu" in subjects[0]): subjects = MMMU_SUBJECTS[dataset] # ..

evaluate_subject()

In the current PR, few-shot examples are loaded from test, and evaluation is also performed on test. Changing the original MMMU evaluation to test as well is likely an unintended regression.

eval_split = "validation" if dataset == "MMMU/MMMU" else "test" test_data = load_data( dataset=dataset, subject=subject, split=eval_split, n_samples=n_samples, streaming=True, )

👍 Done.
What I've changed:

Added explicit argument split to load_data and load_few_shot_examples functions.

Added eval_split and few_shot_split arguments to evaluate_subject function.

Added logic to determine the above two "split" arguments in evaluate_mmmu function.

Forced zero-shot inference for "MMMU/MMMU_Pro" according to your 2nd suggestion.

mhs4670go · 2026-05-19T01:39:05Z

+    "MMMU/MMMU_Pro": [
+        "standard (10 options)",
+        "standard (4 options)",
+        "vision",


When it comes to vision subset, here's a description about the subset in the hugginggface.

In this subset, questions are embedded within screenshots or photos, and models must integrate visual and textual information to answer correctly. No separate text is fed into the model.

Therefore, for MMMU-Pro vision, please avoid reusing build_few_shot_prompt(). The question and answer options are already embedded in the image, so injecting textual choices/few-shot examples changes the benchmark setting.

Suggested fix: add an image-only generation path for dataset == "MMMU/MMMU_Pro" and subject == "vision", ignore n_shots for this subset, and do not pass question/options text into the processor.

diff --git a/tico/quantization/evaluation/mmmu_eval_utils.py b/tico/quantization/evaluation/mmmu_eval_utils.py index 2760e30..xxxxxxx 100644 --- a/tico/quantization/evaluation/mmmu_eval_utils.py +++ b/tico/quantization/evaluation/mmmu_eval_utils.py @@ -19,7 +19,10 @@ import torch from datasets import load_dataset -from tico.quantization.evaluation.vlm_eval_utils import generate_answer +from tico.quantization.evaluation.vlm_eval_utils import ( + generate_answer, + move_inputs_to_device, +) MMMU_DATASETS = ["MMMU/MMMU", "MMMU/MMMU_Pro"] @@ -68,6 +71,56 @@ MMMU_SPLITS: dict[str, list[str]] = { } +MMMU_PRO_VISION_SUBJECT = "vision" + + +def is_mmmu_pro_vision(dataset: str, subject: str) -> bool: + return dataset == "MMMU/MMMU_Pro" and subject == MMMU_PRO_VISION_SUBJECT + + +@torch.no_grad() +def generate_image_only_answer( + model, + processor, + image, + device: str | torch.device, + max_new_tokens: int = 16, + temperature: float = 0.0, + max_seq_len: int | None = None, +) -> str: + """ + Generate an answer from the image only. + + This is used for MMMU-Pro's vision subset, where the question and answer + options are embedded in the image. Do not inject question/options/few-shot + text into the prompt for this subset. + """ + messages = [ + { + "role": "user", + "content": [ + {"type": "image"}, + ], + } + ] + prompt = processor.apply_chat_template( + messages, + tokenize=False, + add_generation_prompt=True, + ) + + processor_kwargs: dict[str, Any] = { + "text": prompt, + "images": image, + "return_tensors": "pt", + } + if max_seq_len is not None and max_seq_len > 0: + processor_kwargs["truncation"] = True + processor_kwargs["max_length"] = max_seq_len + + inputs = processor(**processor_kwargs) + inputs = move_inputs_to_device(inputs, device) + + do_sample = temperature > 0.0 + gen_kwargs: dict[str, Any] = { + "max_new_tokens": max_new_tokens, + "do_sample": do_sample, + } + if do_sample: + gen_kwargs["temperature"] = temperature + + out_ids = model.generate(**inputs, **gen_kwargs) + input_len = inputs["input_ids"].shape[1] + gen_ids = out_ids[0, input_len:] + + return processor.tokenizer.decode(gen_ids, skip_special_tokens=True).strip() + + def take_from_dataset(ds, start: int, n: int) -> Iterable[dict[str, Any]]: assert start >= 0 i = 0 @@ -112,9 +165,11 @@ def load_data( def get_item_mmmu(ex: dict[str, Any]) -> dict[str, Any]: - choices = ex["options"] + choices = ex.get("options", []) if isinstance(choices, str): # Convert string "['choice1', 'choice2']" to a list ['choice1', 'choice2'] - choices = ast.literal_eval(choices) + choices = ast.literal_eval(choices) if choices else [] return { "id": ex["id"], "image": ex["image_1"] if "image_1" in ex else ex["image"], "question": ex["question"] if "question" in ex else "", "choices": choices, - "answer": ex["answer"], + "answer": str(ex["answer"]), } @@ -200,15 +255,23 @@ def extract_answer(generated_text: str) -> str | None: """ text = generated_text.strip() - # Look for standalone letter [A-H] at the beginning, e.g. "A", "a", "A.", "a.", "A. Answer", "A Answer" - first_char_match = re.match(r"^([A-J])([.\s]+[^\s]+)?\.?$", text, re.IGNORECASE) + # Look for a letter at the beginning, e.g. "A", "A.", "(A)", "A Answer". + first_char_match = re.match( + r"^\s*$?([A-J])$?(?:[.)\s]|$)", + text, + re.IGNORECASE, + ) if first_char_match: return first_char_match.group(1).upper() + # Common verbose outputs, e.g. "The answer is C", "Answer: C", "Option C". + answer_match = re.search( + r"\b(?:answer|option|choice)\s*(?:is|:)?\s*$?([A-J])$?\b", + text, + re.IGNORECASE, + ) + if answer_match: + return answer_match.group(1).upper() + return text @@ -286,9 +349,18 @@ def evaluate_subject( A tuple of (correct_count, total_count, skipped_count). """ - few_shot_examples = load_few_shot_examples( - dataset=dataset, subject=subject, n_shots=n_shots - ) + vision_only = is_mmmu_pro_vision(dataset, subject) + + if vision_only: + if n_shots > 0 and verbose: + print( + "\n[WARNING] MMMU-Pro vision subset is evaluated image-only; " + f"ignoring n_shots={n_shots}." + ) + few_shot_examples: list[dict[str, Any]] = [] + else: + few_shot_examples = load_few_shot_examples( + dataset=dataset, subject=subject, n_shots=n_shots + ) test_data = load_data( dataset=dataset, @@ -320,25 +392,37 @@ def evaluate_subject( item = get_item_mmmu(ex) - prompt = build_few_shot_prompt( - question=item["question"], - choices=item["choices"], - subject=subject, - few_shot_examples=few_shot_examples, - ) + if vision_only: + prompt = "<image-only>" + else: + prompt = build_few_shot_prompt( + question=item["question"], + choices=item["choices"], + subject=subject, + few_shot_examples=few_shot_examples, + ) try: - generated = generate_answer( - model=model, - processor=processor, - question=prompt, - image=item["image"], - device=device, - max_new_tokens=max_new_tokens, - max_seq_len=max_seq_len, - temperature=temperature, - ) + if vision_only: + generated = generate_image_only_answer( + model=model, + processor=processor, + image=item["image"], + device=device, + max_new_tokens=max_new_tokens, + max_seq_len=max_seq_len, + temperature=temperature, + ) + else: + generated = generate_answer( + model=model, + processor=processor, + question=prompt, + image=item["image"], + device=device, + max_new_tokens=max_new_tokens, + max_seq_len=max_seq_len, + temperature=temperature, + ) except ValueError as error: if "Mismatch in `image` token count between text and `input_ids`." in str( error @@ -365,8 +449,12 @@ def evaluate_subject( if verbose: print(f"\n[Sample {total}] Subject: {subject}") - print(f"Q: {item['question'][:100]}...") - print(f"Choices: {item['choices']}") + if vision_only: + print("Q: <embedded in image>") + print("Choices: <embedded in image>") + else: + print(f"Q: {item['question'][:100]}...") + print(f"Choices: {item['choices']}") print( f"Generated: {generated}, Predicted: {predicted}, Gold: {gold}, Correct: {is_correct}" )

I've implemented your suggestion regarding the "vision" subset with slight corrections:

I've put generate_image_only_answer to vlm_eval_utils.py.

Didn't touch get_item_mmmu function because I think choices = ex.get("options", []) may conceal errors.

Also, it looks like giving only an image to the model is too challenging for it, because it generates image description rather than answering to question leading to 0 correct answers. Do you think that we should provide at least some textual context along with the image, saying something like "This image contains a picture and a multichoice question. Answer the question with a single letter."?

I agree that a completely empty text prompt may make chat-style VLMs default to image captioning, which would not really measure the intended multiple-choice answering behavior.

I think adding a fixed, sample-agnostic instruction is acceptable.

For example, I would prefer something like:

Answer the multiple-choice question shown in the image. Return only one letter from A to J.

👍 Done.
The accuracy has grown from 0 to 33%

subject correct total skipped accuracy

vision 3 9 1 0.3333

mhs4670go · 2026-05-19T08:39:53Z

+        n_shots = 0
+
+    eval_split = "validation" if dataset == "MMMU/MMMU" else "test"
+    few_shot_split = "test"


Maybe?

Suggested change

few_shot_split = "test"

few_shot_split = "dev"

Well, "test" split is present in both MMMU and MMMU_Pro while "dev" is present in MMMU only...
If we need to choose a different few-shot split, we'll need to add some logic:

few_shot_split = "dev" if dataset == "MMMU/MMMU" else "test"

But do we really need that?

Ah, I thought that you applied forced zero shot in #714 (comment).

Is there any reason to use few shot evaluation? Official homepage also says they use zero shot evaluation. We might as well use zero shot on MMMU and MMMU-pro.

Ah, I thought that you applied forced zero shot in #714 (comment).

Yes, but I returned it back after you had suggested forcing zero-shot for "vision" subset (I thought that assumed that other subsets could still employ few-shot).

Anyway, we can always apply --mmmu_n_shots=0 option, cat't we?
I'd preserve this flexibility 😃

I got your point! Thank you.

This change refactors mmmu_eval_utils.py to support MMMU-Pro benchmark in addition to MMMU. TICO-DCO-1.0-Signed-off-by: d.savchenkov <d.savchenkov@partner.samsung.com>

mhs4670go

LGTM

Torrero

LGTM

dvsav requested a review from Torrero May 18, 2026 15:22

dvsav marked this pull request as ready for review May 18, 2026 15:29

mhs4670go reviewed May 19, 2026

View reviewed changes

dvsav force-pushed the mmmu_pro branch from 9afaa35 to e18f35e Compare May 19, 2026 07:44

mhs4670go reviewed May 19, 2026

View reviewed changes

dvsav force-pushed the mmmu_pro branch from e18f35e to bd8156c Compare May 19, 2026 09:13

dvsav requested a review from mhs4670go May 19, 2026 10:33

[quantization] Support Evaluation of Qwen3-VL With MMMU-Pro Dataset

31f61d0

This change refactors mmmu_eval_utils.py to support MMMU-Pro benchmark in addition to MMMU. TICO-DCO-1.0-Signed-off-by: d.savchenkov <d.savchenkov@partner.samsung.com>

dvsav force-pushed the mmmu_pro branch from bd8156c to 31f61d0 Compare May 19, 2026 13:33

mhs4670go approved these changes May 20, 2026

View reviewed changes

Torrero approved these changes May 20, 2026

View reviewed changes

mhs4670go merged commit 6c0e023 into Samsung:main May 21, 2026
7 checks passed

Conversation

dvsav commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Implementation Details

tico/quantization/evaluation/mmmu_eval_utils.py** (refactored)

tico/quantization/wrapq/examples/quantize_qwen3_vl_with_gptq.py** (updated)

Example

Uh oh!

Choose a reason for hiding this comment

Recommended code

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dvsav May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dvsav May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhs4670go May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhs4670go left a comment

Choose a reason for hiding this comment

Uh oh!

Torrero left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dvsav commented May 18, 2026 •

edited

Loading

`tico/quantization/evaluation/mmmu_eval_utils.py`** (refactored)

`tico/quantization/wrapq/examples/quantize_qwen3_vl_with_gptq.py`** (updated)

dvsav May 19, 2026 •

edited

Loading

dvsav May 19, 2026 •

edited

Loading

mhs4670go May 19, 2026 •

edited

Loading