Skip to content

[quantization] Support Evaluation of Qwen3-VL With MMMU-Pro Dataset#714

Merged
mhs4670go merged 1 commit into
Samsung:mainfrom
dvsav:mmmu_pro
May 21, 2026
Merged

[quantization] Support Evaluation of Qwen3-VL With MMMU-Pro Dataset#714
mhs4670go merged 1 commit into
Samsung:mainfrom
dvsav:mmmu_pro

Conversation

@dvsav
Copy link
Copy Markdown
Contributor

@dvsav dvsav commented May 18, 2026

What

This PR extends the MMMU evaluation module to support the MMMU-Pro benchmark (MMMU/MMMU_Pro) in addition to the existing MMMU dataset (MMMU/MMMU). The mmmu_eval_utils.py module is refactored to be dataset-aware, allowing the same evaluation pipeline to handle both benchmarks through a dataset parameter.

Why

MMMU is listed as a required benchmark in PTQ Evaluation — Qwen3-VL. MMMU-Pro differs from MMMU in that it includes vision-only questions and supports up to 10 answer options (vs. 4 in MMMU), requiring adjustments to the answer extraction and evaluation logic.

Implementation Details

tico/quantization/evaluation/mmmu_eval_utils.py** (refactored)

  • MMMU_SUBJECTS and MMMU_SPLITS changed from flat lists to dictionaries keyed by dataset name, adding MMMU-Pro subjects (standard (10 options), standard (4 options), vision) and splits (test only)
  • load_data(), load_few_shot_examples(), evaluate_subject(), evaluate_mmmu() — all gained a dataset parameter to select between MMMU/MMMU and MMMU/MMMU_Pro
  • get_item_mmmu() — handles MMMU-Pro's different field names (ex["image"] instead of ex["image_1"], missing question field for vision-only items)
  • extract_answer() — regex expanded from [A-H] to [A-J] to support up to 10 options in MMMU-Pro
  • evaluate_subject() — gracefully skips samples where the prompt exceeds max_seq_len (catches ValueError for image token count mismatch), and handles missing image_2 field
  • Evaluation split changed from validation to test (MMMU-Pro only has test)

tico/quantization/wrapq/examples/quantize_qwen3_vl_with_gptq.py** (updated)

  • Added --mmmu_dataset CLI argument with choices MMMU/MMMU and MMMU/MMMU_Pro
  • MMMU evaluation now triggered by --mmmu_dataset instead of --mmmu_subjects alone
  • dataset parameter passed through to evaluate_mmmu() in both original and quantized model evaluation paths

Example

python tico/quantization/wrapq/examples/quantize_qwen3_vl_with_gptq.py \
    --model=Qwen/Qwen3-VL-4B-Instruct \
    --cache_dir=/home/d.savchenkov/models/qwen3-vl-4b \
    --trust-remote-code \
    --no_GPTQ \
    --mmmu_dataset=MMMU/MMMU_Pro \
    --mmmu_subjects=vision \
    --mmmu_n_shots=5 \
    --mmmu_n_samples=10 \
    --embedding_weight_bits=16 \
    --vision_patch_embed_weight_bits=16 \
    --linear_weight_bits=16 \
    --lm_head_weight_bits=16 \
    --nsamples_for_qcalibration=10 \
    --verbose
=== MMMU Evaluation (Original Model) ===
| subject                                            | correct    | total      | skipped    | accuracy   |
| -------------------------------------------------- | ---------- | ---------- | ---------- | ---------- |
| vision                                             | 5          | 9          | 1          | 0.5556     |

=== MMMU Evaluation (Quantized Model) ===
| subject                                            | correct    | total      | skipped    | accuracy   |
| -------------------------------------------------- | ---------- | ---------- | ---------- | ---------- |
| vision                                             | 5          | 9          | 1          | 0.5556     |

Note

Now one needs to specify --mmmu_dataset command-line option to choose between MMMU/MMMU and MMMU/MMMU_Pro.

@dvsav dvsav requested a review from Torrero May 18, 2026 15:22
@dvsav dvsav marked this pull request as ready for review May 18, 2026 15:29
dataset=dataset,
subject=subject,
split="dev",
split="test",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you changed this because MMMU-pro doesn't have dev split. Please correct me if I'm wrong.

That being said, we should not construct few-shot examples from the MMMU-Pro test split. MMMU-Pro benchmark is designed as a stricter multimodal evaluation setting, including vision-only tasks. In such benchmarks, using labeled test samples as prompt exemplars can significantly distort the reported accuracy.


My recommendation is to not allow few-shot for MMMU-Pro and force/switch it to zero-shot.

The best structure is:

  1. For MMMU/MMMU, keep the original behavior: few-shot from dev, evaluation on validation.
  2. For MMMU/MMMU_Pro, since there is no separate exemplar split, evaluate with n_shots=0.
  3. Even if the user runs --mmmu_dataset=MMMU/MMMU_Pro --mmmu_n_shots=5, the code should not crash. It should print a warning and continue in zero-shot mode.

This is the least inconvenient option for users and the safest option for evaluation reliability.

Recommended code

  1. load_few_shot_examples()
def load_few_shot_examples(
    dataset: str,
    subject: str,
    n_shots: int = 5,
) -> list[dict[str, Any]]:
    """
    Load few-shot examples for a given MMMU subject.

    For MMMU/MMMU, examples are loaded from the `dev` split.
    For MMMU/MMMU_Pro, few-shot examples are not loaded because there is no
    separate exemplar split in the current setup.
    """
    if n_shots <= 0:
        return []

    if dataset == "MMMU/MMMU_Pro":
        return []

    ds = load_data(
        dataset=dataset,
        subject=subject,
        split="dev",
        n_samples=n_shots,
        streaming=True,
    )

    return [get_item_mmmu(ex) for ex in ds]

I would not use raise ValueError here. This is a low-level utility function, and it is not a great place to explain CLI usage to the user.

  1. evaluate_mmmu()
def evaluate_mmmu(
    model,
    processor,
    dataset: str,
    subjects: list[str] | None = None,
    device: str | torch.device = "cuda",
    n_shots: int = 5,
    n_samples: int = -1,
    max_new_tokens: int = 16,
    max_seq_len: int | None = None,
    temperature: float = 0.0,
    verbose: bool = True,
) -> dict[str, tuple[int, int, int]]:
    if dataset not in MMMU_DATASETS:
        raise ValueError(f"Invalid dataset '{dataset}'")

    if dataset == "MMMU/MMMU_Pro" and n_shots > 0:
        if verbose:
            print(
                "[WARNING] MMMU-Pro few-shot evaluation is disabled because "
                "no separate few-shot split is available. Running zero-shot "
                "with n_shots=0."
            )
        n_shots = 0

    if subjects is None or (len(subjects) == 1 and "mmmu" in subjects[0]):
        subjects = MMMU_SUBJECTS[dataset]

    # ..
  1. evaluate_subject()

In the current PR, few-shot examples are loaded from test, and evaluation is also performed on test. Changing the original MMMU evaluation to test as well is likely an unintended regression.

eval_split = "validation" if dataset == "MMMU/MMMU" else "test"

test_data = load_data(
    dataset=dataset,
    subject=subject,
    split=eval_split,
    n_samples=n_samples,
    streaming=True,
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Done.
What I've changed:

  1. Added explicit argument split to load_data and load_few_shot_examples functions.
  2. Added eval_split and few_shot_split arguments to evaluate_subject function.
  3. Added logic to determine the above two "split" arguments in evaluate_mmmu function.
  4. Forced zero-shot inference for "MMMU/MMMU_Pro" according to your 2nd suggestion.

"MMMU/MMMU_Pro": [
"standard (10 options)",
"standard (4 options)",
"vision",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it comes to vision subset, here's a description about the subset in the hugginggface.

In this subset, questions are embedded within screenshots or photos, and models must integrate visual and textual information to answer correctly. No separate text is fed into the model.

Therefore, for MMMU-Pro vision, please avoid reusing build_few_shot_prompt(). The question and answer options are already embedded in the image, so injecting textual choices/few-shot examples changes the benchmark setting.

Suggested fix: add an image-only generation path for dataset == "MMMU/MMMU_Pro" and subject == "vision", ignore n_shots for this subset, and do not pass question/options text into the processor.

diff --git a/tico/quantization/evaluation/mmmu_eval_utils.py b/tico/quantization/evaluation/mmmu_eval_utils.py
index 2760e30..xxxxxxx 100644
--- a/tico/quantization/evaluation/mmmu_eval_utils.py
+++ b/tico/quantization/evaluation/mmmu_eval_utils.py
@@ -19,7 +19,10 @@ import torch
 from datasets import load_dataset
 
-from tico.quantization.evaluation.vlm_eval_utils import generate_answer
+from tico.quantization.evaluation.vlm_eval_utils import (
+    generate_answer,
+    move_inputs_to_device,
+)
 
 
 MMMU_DATASETS = ["MMMU/MMMU", "MMMU/MMMU_Pro"]
@@ -68,6 +71,56 @@ MMMU_SPLITS: dict[str, list[str]] = {
 }
 
 
+MMMU_PRO_VISION_SUBJECT = "vision"
+
+
+def is_mmmu_pro_vision(dataset: str, subject: str) -> bool:
+    return dataset == "MMMU/MMMU_Pro" and subject == MMMU_PRO_VISION_SUBJECT
+
+
+@torch.no_grad()
+def generate_image_only_answer(
+    model,
+    processor,
+    image,
+    device: str | torch.device,
+    max_new_tokens: int = 16,
+    temperature: float = 0.0,
+    max_seq_len: int | None = None,
+) -> str:
+    """
+    Generate an answer from the image only.
+
+    This is used for MMMU-Pro's vision subset, where the question and answer
+    options are embedded in the image. Do not inject question/options/few-shot
+    text into the prompt for this subset.
+    """
+    messages = [
+        {
+            "role": "user",
+            "content": [
+                {"type": "image"},
+            ],
+        }
+    ]
+    prompt = processor.apply_chat_template(
+        messages,
+        tokenize=False,
+        add_generation_prompt=True,
+    )
+
+    processor_kwargs: dict[str, Any] = {
+        "text": prompt,
+        "images": image,
+        "return_tensors": "pt",
+    }
+    if max_seq_len is not None and max_seq_len > 0:
+        processor_kwargs["truncation"] = True
+        processor_kwargs["max_length"] = max_seq_len
+
+    inputs = processor(**processor_kwargs)
+    inputs = move_inputs_to_device(inputs, device)
+
+    do_sample = temperature > 0.0
+    gen_kwargs: dict[str, Any] = {
+        "max_new_tokens": max_new_tokens,
+        "do_sample": do_sample,
+    }
+    if do_sample:
+        gen_kwargs["temperature"] = temperature
+
+    out_ids = model.generate(**inputs, **gen_kwargs)
+    input_len = inputs["input_ids"].shape[1]
+    gen_ids = out_ids[0, input_len:]
+
+    return processor.tokenizer.decode(gen_ids, skip_special_tokens=True).strip()
+
+
 def take_from_dataset(ds, start: int, n: int) -> Iterable[dict[str, Any]]:
     assert start >= 0
     i = 0
@@ -112,9 +165,11 @@ def load_data(
 
 
 def get_item_mmmu(ex: dict[str, Any]) -> dict[str, Any]:
-    choices = ex["options"]
+    choices = ex.get("options", [])
     if isinstance(choices, str):
         # Convert string "['choice1', 'choice2']" to a list ['choice1', 'choice2']
-        choices = ast.literal_eval(choices)
+        choices = ast.literal_eval(choices) if choices else []
 
     return {
         "id": ex["id"],
         "image": ex["image_1"] if "image_1" in ex else ex["image"],
         "question": ex["question"] if "question" in ex else "",
         "choices": choices,
-        "answer": ex["answer"],
+        "answer": str(ex["answer"]),
     }
@@ -200,15 +255,23 @@ def extract_answer(generated_text: str) -> str | None:
     """
     text = generated_text.strip()
 
-    # Look for standalone letter [A-H] at the beginning, e.g. "A", "a", "A.", "a.", "A. Answer", "A Answer"
-    first_char_match = re.match(r"^([A-J])([.\s]+[^\s]+)?\.?$", text, re.IGNORECASE)
+    # Look for a letter at the beginning, e.g. "A", "A.", "(A)", "A Answer".
+    first_char_match = re.match(
+        r"^\s*\(?([A-J])\)?(?:[.)\s]|$)",
+        text,
+        re.IGNORECASE,
+    )
     if first_char_match:
         return first_char_match.group(1).upper()
 
+    # Common verbose outputs, e.g. "The answer is C", "Answer: C", "Option C".
+    answer_match = re.search(
+        r"\b(?:answer|option|choice)\s*(?:is|:)?\s*\(?([A-J])\)?\b",
+        text,
+        re.IGNORECASE,
+    )
+    if answer_match:
+        return answer_match.group(1).upper()
+
     return text
 
 
@@ -286,9 +349,18 @@ def evaluate_subject(
         A tuple of (correct_count, total_count, skipped_count).
     """
-    few_shot_examples = load_few_shot_examples(
-        dataset=dataset, subject=subject, n_shots=n_shots
-    )
+    vision_only = is_mmmu_pro_vision(dataset, subject)
+
+    if vision_only:
+        if n_shots > 0 and verbose:
+            print(
+                "\n[WARNING] MMMU-Pro vision subset is evaluated image-only; "
+                f"ignoring n_shots={n_shots}."
+            )
+        few_shot_examples: list[dict[str, Any]] = []
+    else:
+        few_shot_examples = load_few_shot_examples(
+            dataset=dataset, subject=subject, n_shots=n_shots
+        )
 
     test_data = load_data(
         dataset=dataset,
@@ -320,25 +392,37 @@ def evaluate_subject(
 
         item = get_item_mmmu(ex)
 
-        prompt = build_few_shot_prompt(
-            question=item["question"],
-            choices=item["choices"],
-            subject=subject,
-            few_shot_examples=few_shot_examples,
-        )
+        if vision_only:
+            prompt = "<image-only>"
+        else:
+            prompt = build_few_shot_prompt(
+                question=item["question"],
+                choices=item["choices"],
+                subject=subject,
+                few_shot_examples=few_shot_examples,
+            )
 
         try:
-            generated = generate_answer(
-                model=model,
-                processor=processor,
-                question=prompt,
-                image=item["image"],
-                device=device,
-                max_new_tokens=max_new_tokens,
-                max_seq_len=max_seq_len,
-                temperature=temperature,
-            )
+            if vision_only:
+                generated = generate_image_only_answer(
+                    model=model,
+                    processor=processor,
+                    image=item["image"],
+                    device=device,
+                    max_new_tokens=max_new_tokens,
+                    max_seq_len=max_seq_len,
+                    temperature=temperature,
+                )
+            else:
+                generated = generate_answer(
+                    model=model,
+                    processor=processor,
+                    question=prompt,
+                    image=item["image"],
+                    device=device,
+                    max_new_tokens=max_new_tokens,
+                    max_seq_len=max_seq_len,
+                    temperature=temperature,
+                )
         except ValueError as error:
             if "Mismatch in `image` token count between text and `input_ids`." in str(
                 error
@@ -365,8 +449,12 @@ def evaluate_subject(
 
         if verbose:
             print(f"\n[Sample {total}] Subject: {subject}")
-            print(f"Q: {item['question'][:100]}...")
-            print(f"Choices: {item['choices']}")
+            if vision_only:
+                print("Q: <embedded in image>")
+                print("Choices: <embedded in image>")
+            else:
+                print(f"Q: {item['question'][:100]}...")
+                print(f"Choices: {item['choices']}")
             print(
                 f"Generated: {generated}, Predicted: {predicted}, Gold: {gold}, Correct: {is_correct}"
             )

Copy link
Copy Markdown
Contributor Author

@dvsav dvsav May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented your suggestion regarding the "vision" subset with slight corrections:

  • I've put generate_image_only_answer to vlm_eval_utils.py.
  • Didn't touch get_item_mmmu function because I think choices = ex.get("options", []) may conceal errors.

Also, it looks like giving only an image to the model is too challenging for it, because it generates image description rather than answering to question leading to 0 correct answers. Do you think that we should provide at least some textual context along with the image, saying something like "This image contains a picture and a multichoice question. Answer the question with a single letter."?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that a completely empty text prompt may make chat-style VLMs default to image captioning, which would not really measure the intended multiple-choice answering behavior.

I think adding a fixed, sample-agnostic instruction is acceptable.

For example, I would prefer something like:

Answer the multiple-choice question shown in the image. Return only one letter from A to J.

Copy link
Copy Markdown
Contributor Author

@dvsav dvsav May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Done.
The accuracy has grown from 0 to 33%

subject correct total skipped accuracy
vision 3 9 1 0.3333

n_shots = 0

eval_split = "validation" if dataset == "MMMU/MMMU" else "test"
few_shot_split = "test"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe?

Suggested change
few_shot_split = "test"
few_shot_split = "dev"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, "test" split is present in both MMMU and MMMU_Pro while "dev" is present in MMMU only...
If we need to choose a different few-shot split, we'll need to add some logic:

few_shot_split = "dev" if dataset == "MMMU/MMMU" else "test"

But do we really need that?

Copy link
Copy Markdown
Contributor

@mhs4670go mhs4670go May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I thought that you applied forced zero shot in #714 (comment).

Is there any reason to use few shot evaluation? Official homepage also says they use zero shot evaluation. We might as well use zero shot on MMMU and MMMU-pro.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I thought that you applied forced zero shot in #714 (comment).

Yes, but I returned it back after you had suggested forcing zero-shot for "vision" subset (I thought that assumed that other subsets could still employ few-shot).

Anyway, we can always apply --mmmu_n_shots=0 option, cat't we?
I'd preserve this flexibility 😃

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got your point! Thank you.

@dvsav dvsav requested a review from mhs4670go May 19, 2026 10:33
This change refactors mmmu_eval_utils.py to support MMMU-Pro benchmark in addition to MMMU.

TICO-DCO-1.0-Signed-off-by: d.savchenkov <d.savchenkov@partner.samsung.com>
Copy link
Copy Markdown
Contributor

@mhs4670go mhs4670go left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@Torrero Torrero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mhs4670go mhs4670go merged commit 6c0e023 into Samsung:main May 21, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants