Skip to content

layout-1: 15 ground-truth entries are .mp4 files, not images (NIMA can't score) #6

@mohitgargai

Description

@mohitgargai

Summary

layout-1 (IntentToLayoutGeneration — "Generate a flattened layout image from intent text") has 15 samples whose ground_truth.image points to an .mp4 video file instead of a .png image. layout-1's primary metric is nima_score, which runs a PyTorch image-aesthetics model on the file via PIL.Image.open(...) — PIL rejects video files (UnidentifiedImageError), so these samples can't be scored at all.

The benchmark's own evaluate() produces nan for every image metric on these samples when given an empty ModelOutput, and any oracle / agent run that tries to Image.open(...) the ground truth crashes outright.

Scope

  • Directory: benchmarks/layout/layout2-intention-to-layout-generation/images/
  • Counts: 85 .png (valid) + 15 .mp4 (broken) = 100 samples total
  • Affected samples (source IDs + filenames):
source_id file
layout-1:1 04zQ50DpwzhXfznJC0fK.mp4
layout-1:9 0av452xBKWsVWrPWZVM5.mp4
layout-1:13 0ixLHb8kLtuAVQavdRT6.mp4
layout-1:19 0wIPzODxoCDEcrkAmHyf.mp4
layout-1:27 1Okni6tFj315PiOVADBx.mp4
layout-1:29 1YrP2nlMDasJFMoFLUKS.mp4
layout-1:35 1kLkygM7pgrGcfMkvfjV.mp4
layout-1:50 2ALsFATNayguZJNbRMkA.mp4
layout-1:53 2GNqKi6AAlq3pRufiANh.mp4
layout-1:74 3f0qiFLXCUS8M72ySAO8.mp4
layout-1:75 3gIgFKADrYr3v2uN0n5W.mp4
layout-1:80 3pX1YupN1ulLamAZtVhg.mp4
layout-1:87 44QT524vSyhpljWrNEWr.mp4
layout-1:88 44ZMntKp3FqnLNiFoxdV.mp4
layout-1:91 4QXfdSFrS50zUtllJKos.mp4

Reproduction

from PIL import Image
Image.open("benchmarks/layout/layout2-intention-to-layout-generation/images/04zQ50DpwzhXfznJC0fK.mp4")
# -> PIL.UnidentifiedImageError: cannot identify image file '...mp4'

Full-source oracle sweep across layout-1's 100 samples hits this on exactly these 15, passes on the other 85.

Suggested fix (pick one)

  1. Replace each .mp4 with the intended .png — ideal if the original flat layout renders are available. Zero downstream breakage.
  2. Drop the 15 entries from the layout-1 sample manifest — simplest; total goes 100 → 85. Any consumer caching by source_id would need to re-sync.

Impact

  • Oracle verification can't reach 100% on full layout-1 in any harness (upstream or adapter).
  • Surfaced while verifying the Harbor adapter (harbor-framework/harbor#1433) — current README documents 99.96% pass on the 33,786-task full source due to these 15.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions