Summary
layout-1 (IntentToLayoutGeneration — "Generate a flattened layout image from intent text") has 15 samples whose ground_truth.image points to an .mp4 video file instead of a .png image. layout-1's primary metric is nima_score, which runs a PyTorch image-aesthetics model on the file via PIL.Image.open(...) — PIL rejects video files (UnidentifiedImageError), so these samples can't be scored at all.
The benchmark's own evaluate() produces nan for every image metric on these samples when given an empty ModelOutput, and any oracle / agent run that tries to Image.open(...) the ground truth crashes outright.
Scope
- Directory:
benchmarks/layout/layout2-intention-to-layout-generation/images/
- Counts: 85
.png (valid) + 15 .mp4 (broken) = 100 samples total
- Affected samples (source IDs + filenames):
| source_id |
file |
layout-1:1 |
04zQ50DpwzhXfznJC0fK.mp4 |
layout-1:9 |
0av452xBKWsVWrPWZVM5.mp4 |
layout-1:13 |
0ixLHb8kLtuAVQavdRT6.mp4 |
layout-1:19 |
0wIPzODxoCDEcrkAmHyf.mp4 |
layout-1:27 |
1Okni6tFj315PiOVADBx.mp4 |
layout-1:29 |
1YrP2nlMDasJFMoFLUKS.mp4 |
layout-1:35 |
1kLkygM7pgrGcfMkvfjV.mp4 |
layout-1:50 |
2ALsFATNayguZJNbRMkA.mp4 |
layout-1:53 |
2GNqKi6AAlq3pRufiANh.mp4 |
layout-1:74 |
3f0qiFLXCUS8M72ySAO8.mp4 |
layout-1:75 |
3gIgFKADrYr3v2uN0n5W.mp4 |
layout-1:80 |
3pX1YupN1ulLamAZtVhg.mp4 |
layout-1:87 |
44QT524vSyhpljWrNEWr.mp4 |
layout-1:88 |
44ZMntKp3FqnLNiFoxdV.mp4 |
layout-1:91 |
4QXfdSFrS50zUtllJKos.mp4 |
Reproduction
from PIL import Image
Image.open("benchmarks/layout/layout2-intention-to-layout-generation/images/04zQ50DpwzhXfznJC0fK.mp4")
# -> PIL.UnidentifiedImageError: cannot identify image file '...mp4'
Full-source oracle sweep across layout-1's 100 samples hits this on exactly these 15, passes on the other 85.
Suggested fix (pick one)
- Replace each
.mp4 with the intended .png — ideal if the original flat layout renders are available. Zero downstream breakage.
- Drop the 15 entries from the layout-1 sample manifest — simplest; total goes 100 → 85. Any consumer caching by source_id would need to re-sync.
Impact
- Oracle verification can't reach 100% on full
layout-1 in any harness (upstream or adapter).
- Surfaced while verifying the Harbor adapter (harbor-framework/harbor#1433) — current README documents 99.96% pass on the 33,786-task full source due to these 15.
Summary
layout-1(IntentToLayoutGeneration— "Generate a flattened layout image from intent text") has 15 samples whoseground_truth.imagepoints to an.mp4video file instead of a.pngimage.layout-1's primary metric isnima_score, which runs a PyTorch image-aesthetics model on the file viaPIL.Image.open(...)— PIL rejects video files (UnidentifiedImageError), so these samples can't be scored at all.The benchmark's own
evaluate()producesnanfor every image metric on these samples when given an empty ModelOutput, and any oracle / agent run that tries toImage.open(...)the ground truth crashes outright.Scope
benchmarks/layout/layout2-intention-to-layout-generation/images/.png(valid) + 15.mp4(broken) = 100 samples totallayout-1:104zQ50DpwzhXfznJC0fK.mp4layout-1:90av452xBKWsVWrPWZVM5.mp4layout-1:130ixLHb8kLtuAVQavdRT6.mp4layout-1:190wIPzODxoCDEcrkAmHyf.mp4layout-1:271Okni6tFj315PiOVADBx.mp4layout-1:291YrP2nlMDasJFMoFLUKS.mp4layout-1:351kLkygM7pgrGcfMkvfjV.mp4layout-1:502ALsFATNayguZJNbRMkA.mp4layout-1:532GNqKi6AAlq3pRufiANh.mp4layout-1:743f0qiFLXCUS8M72ySAO8.mp4layout-1:753gIgFKADrYr3v2uN0n5W.mp4layout-1:803pX1YupN1ulLamAZtVhg.mp4layout-1:8744QT524vSyhpljWrNEWr.mp4layout-1:8844ZMntKp3FqnLNiFoxdV.mp4layout-1:914QXfdSFrS50zUtllJKos.mp4Reproduction
Full-source oracle sweep across layout-1's 100 samples hits this on exactly these 15, passes on the other 85.
Suggested fix (pick one)
.mp4with the intended.png— ideal if the original flat layout renders are available. Zero downstream breakage.Impact
layout-1in any harness (upstream or adapter).