chore(eval): make the reproduction package self-contained and public-safe by andylizf · Pull Request #90 · StarTrail-org/PixelRAG

andylizf · 2026-06-24T05:18:40Z

Summary

Make the eval/ reproduction package self-contained and safe to read as a public artifact.

Reproduction

Single entry doc: rename eval/REPRODUCE.md → eval/README.md (so it auto-renders on
the eval/ directory) and update all references.
Three documented ways to supply tile images to the reader: a self-hosted serve with
materialized tiles, the public search API, or a self-hosted serve that renders tiles on
demand from a kiwix ZIM. TILES_DIR is now optional — the reader can consume the
serve-returned base64 tiles instead of staging a local corpus.
All tile rendering is serve-side now. Removed the reader-side "local-wiki" rendering
path entirely: drops LocalWikiTiledScreenshotRetriever, the lookup_reference_url
machinery, and the --local-wiki / --local-wiki-screenshot-dir / --lookup-reference-url
flags (they relied on an out-of-tree module and hardcoded placeholder paths).
Benchmarks run on their full (filtered) sets. The 1000-example subsample is kept only
for nq and sqa, matching what the paper reports.

Hygiene

Read the Jina API key from JINA_API_KEY instead of a hardcoded default.
Update stale notes now that the FAISS indexes and the tile corpus are published on HF.
Drop internal working-notes docs from the repo and scrub leftover references to old
internal repo/module names.
Tidy .gitignore to use generic patterns, plus a scoped eval/.gitignore.

Verification

python -c "from lib import retrieval, retrievers, benchmarks, grader, simpleqa_data, simpleqa_filter" imports cleanly
run_bench.py --help works; --local-api / --text-api intact
bash -n reproduce.sh passes
no residual references to the removed feature

…safe Reproduction: - Consolidate to a single entry doc: rename eval/REPRODUCE.md -> eval/README.md (auto-renders on the eval/ dir) and update all references. - Document the three ways to supply tile images to the reader: self-hosted serve with materialized tiles, the public search API, and a self-hosted serve that renders tiles on demand from a kiwix ZIM. Make TILES_DIR optional so the reader can use serve-returned base64 tiles instead of a local corpus. - Remove the reader-side "local-wiki" rendering path entirely so all tile rendering happens serve-side: drop LocalWikiTiledScreenshotRetriever, the lookup_reference_url machinery, and the --local-wiki / --local-wiki-screenshot-dir / --lookup-reference-url flags (they relied on an out-of-tree module and hardcoded placeholder paths). - Run benchmarks on their full (filtered) sets; keep the 1000-example subsample only for nq and sqa, which is what the paper reports. Hygiene: - Read the Jina API key from JINA_API_KEY instead of a hardcoded default. - Update stale notes now that the indexes and tile corpus are published on HF. - Drop internal working-notes docs from the repo and scrub leftover references to old internal repo/module names. - Tidy .gitignore to use generic patterns and add a scoped eval/.gitignore.

vercel · 2026-06-24T05:18:46Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
web	Ready	Preview, Comment	Jun 26, 2026 6:22pm

… the reader Two pre-existing bugs that silently broke cells, surfaced by a from-scratch reproduction of NQ/base via the public API: - grader: NQ / NQ-Tables store gold answers under `gold_answers`, which `_golds_for` did not read, so every exact-match cell graded 0. Add the key. - reader: retrieved tiles were attached only when `os.path.exists(path)` is true, but in the public-API and on-demand-render modes the serve returns each tile inline as base64 (the "path" is the base64 string itself). Those tiles were silently dropped and the reader answered from parametric memory (effectively naive). Add a `_tile_image_b64` helper that accepts both a local file path and inline base64, and use it at all four tile-attachment sites (build_messages's three branches + `_encode_images_to_content`). Verified: NQ/base smoke (20 examples, public API, base64 tiles) now grades 10/20 = 50% with tiles reaching the reader, vs 0 before.

…he public-API path - Loosen eval/pyproject deps from exact `==` pins to `>=` floors. uv.lock is the reproducibility contract (`uv sync --frozen`); the blanket `==` pins were a freeze artifact, inconsistent with the rest of the workspace (which floors), and they conflicted with vLLM (numpy 2.4.6 vs vLLM's numpy<2.3). - Add an optional `reader` extra (`uv sync --extra reader`) that installs vLLM 0.19.0, so the Qwen3.5-4B reader can be self-hosted from this package. vLLM needs numpy<2.3, hence the cap (nothing in the base requires >=2.3). - README: document installing the reader via the extra; add the command to run a cell against the public API (reproduce.sh hardcodes localhost, so it can't do the public-API path); note that public :30001 serves the un-normed base index, which does not match the paper's base/lora cells (those use search_index_normed_v2).

The paper's published NQ/NQT numbers use the gpt-4.1 LLM judge (semantic match), not strict exact-match. The reader answers short but paraphrases the gold span, so strict exact-match scores ~20pp lower (≈ the naive number) even when the answer is correct — measured NQ base = 29% exact-match vs 53% LLM-judge (paper 57.9). - grader: add a --llm-judge flag; for nq/nq_tables/triviaqa it grades with the existing gpt-4.1 judge (ground truth = "Any of: <gold aliases>") instead of exact-match. The default stays strict exact-match (no API key needed). - reproduce.sh: pass --llm-judge for nq/nqt so the turnkey reproduction matches the paper. - README: note NQ/NQT paper numbers are LLM-judge; update the cell table + grader section.

…elf-hosting the search serve - The `serve` extra was missing torchvision, which transformers' Qwen3-VL processor imports — self-hosting the search serve failed until it was added by hand. Add it (cu129 via the existing tool.uv.sources, matching the embed extra). - eval/README: add a "self-hosting the search serve" note — the [serve] install, the ~217G index + ~220G RAM, where articles.json lives (the pixelrag-tiles dataset), and the tiles options (full corpus vs --render-on-demand from a kiwix ZIM).

Two real bugs that made the serve's --render-on-demand path (Mode 3 in the eval README) return empty tiles, surfaced by a from-scratch reproduction: - serve: the async /search handler called _ondemand_chunk_b64 synchronously, which runs render_url -> asyncio.run() inside the running event loop -> RuntimeError, swallowed, so image_base64 was always empty (the reader fell back to closed-book). Offload it with asyncio.to_thread so render_url's asyncio.run() runs in a thread with no running loop. - render: the cdp/fast_cdp backends force GPU rasterization, which needs GPU device access (the `render` group on lab machines: `sudo usermod -aG render $USER`). On a headless box without it the renderer crashes and CDP capture hangs forever. Add PIXELSHOT_DISABLE_GPU=1 to fall back to CPU rasterization; the default (GPU) path is unchanged.

…o GPU) Our render/serve machines are GPU-less (the serve runs --device cpu; nvidia-smi has no driver), and the stable production capture config is GPU-rasterization-incompatible anyway. Forcing `--enable-gpu-rasterization` by default makes Chrome crash / CDP capture hang on a box without real GPU device access. Flip the default to `--disable-gpu` (CPU rasterization — works on GPU-less and headless boxes, the common case). On a properly configured graphics-GPU render box, set PIXELSHOT_ENABLE_GPU=1 for ~2x throughput (GPU rasterization can also produce blank captures, so verify output there).

…es render fine) A previous commit flipped the cdp/fast_cdp default to --disable-gpu on the theory that --enable-gpu-rasterization crashes on GPU-less / headless machines. That was wrong: verified on a no-GPU box, the original flags render correctly and fast (Chrome falls back to software rasterization, no crash). The crash seen on one borrowed cluster box was specific to that host (patched headless_shell + datacenter GPU), not a general problem — not a reason to change the default. Restore cdp.py / fast_cdp.py to the original BROWSER_ARGS. The genuine Mode-3 bug (the serve's async handler calling render_url -> asyncio.run inside a running event loop) stays fixed via asyncio.to_thread in serve/api.py — that is unrelated to the GPU flags.

…ently go closed-book LocalAPIRetriever/TextAPIRetriever POST to the search serve with a hardcoded 600s timeout and batch_size 32. Against a slow serve (e.g. --render-on-demand, which renders a page per tile), a batch exceeds 600s, the request times out, the batch is cached EMPTY, and the reader answers closed-book — a silently invalid run (looks like a bad score, not an error). Make the timeout configurable via PIXELRAG_RETRIEVAL_TIMEOUT (default 600, unchanged for fast serves); document raising it for mode 3 in the README.

…'t wedge The on-demand renderer called render_url() in-process. render_url uses asyncio.run() + multiprocessing.Pool (fork); the SECOND in-process call deadlocks (fork-in-threaded-process), so a long-lived serve renders the first retrieved page fine and then wedges on the next request. Verified on a no-GPU box: same-process 2nd render hangs (SIGKILL after 40s), a fresh subprocess per render runs 3/3 clean (~1.85s each). Run each render in a subprocess (timeout via PIXELRAG_RENDER_TIMEOUT, default 120s). Pairs with the to_thread offload in api.py that keeps the blocking subprocess off the event loop.

…line python -c The pixelshot CLI already exposes --viewport-width / --tile-height (defaults 875/8192), so use the standard module entry (python -m pixelrag_render.render --output ... --viewport-width ... --tile-height ... --workers 1) instead of an inline python -c string. Same subprocess isolation that fixes the wedge, cleaner invocation. Verified: 2 consecutive renders 1.67s/1.57s, correct *.png.tiles output, no hang.

… that crashes some boxes) git shows --enable-gpu-rasterization/--force-gpu-rasterization have been in the cdp/fast_cdp backends since the initial release (82c5794, then carried through PR #38) — inherited on the assumption GPU rasterization speeds up capture. It doesn't: headless Chrome falls back to the software renderer and ignores the flags (verified on a no-GPU box: enable == disable == 1.7s; the bottleneck is capture IPC, not rasterization — see docs/screenshot-throughput-optimization.md). On a box that has a GPU device but no access (e.g. /dev/dri without the render group), Chrome tries the GPU, the GPU process crashes on init, and capture hangs. Default to --disable-gpu; opt in with PIXELSHOT_ENABLE_GPU=1 only on a real graphics-GPU box with device access. Default render verified working (1.62s/tile).

… harness Clarify that the documented t/s excludes Chrome startup (the bench timer starts after strategy.setup()). A hand-rolled end-to-end loop that counts the ~49s 48-worker startup reports ~13 t/s, which is the startup tax not the capture rate. Re-measured on the reference EPYC 7763 box with the harness: 130 t/s capture-only (200 maxi-ZIM pages, 48w, raw).

…l), ~3x faster On-demand render started a fresh Chrome per page; the cold start dominates (seconds locally, tens of seconds on a cold/NFS box — that's why a Mode-3 NQ run was ~2.3min/example, ~95% of it in rendering). Keep one headless Chrome alive in OnDemandTiles and render each page in a fresh tab via the pixelshot --cdp-url attach mode (PR #76). Still a separate subprocess per render (render_url's asyncio.run + multiprocessing.Pool would deadlock on a 2nd in-process call), but it attaches to the live browser instead of launching one. Verified on a no-GPU box: 6 consecutive renders 0.47s each (vs ~1.6s/page launching fresh) with no hang; Chrome is auto-restarted if a render fails, killed at exit.

…r-instant startup read_index reads the whole index into RAM — a multi-100G index over NFS takes ~1-2h to load. With PIXELRAG_INDEX_MMAP=1, faiss.read_index uses IO_FLAG_MMAP: startup is near-instant (no full read), inverted lists are paged in on demand at query time, and the OS page cache keeps hot lists resident. Ideal for a few-hundred-query eval that touches only a fraction of the index.

vercel Bot deployed to Preview June 24, 2026 05:19 View deployment

vercel Bot deployed to Preview June 24, 2026 08:11 View deployment

vercel Bot deployed to Preview June 24, 2026 09:01 View deployment

vercel Bot deployed to Preview June 26, 2026 03:46 View deployment

vercel Bot deployed to Preview June 26, 2026 03:53 View deployment

vercel Bot deployed to Preview June 26, 2026 10:36 View deployment

vercel Bot deployed to Preview June 26, 2026 11:01 View deployment

andylizf added 2 commits June 26, 2026 05:39

vercel Bot deployed to Preview June 26, 2026 12:42 View deployment

vercel Bot deployed to Preview June 26, 2026 15:31 View deployment

vercel Bot deployed to Preview June 26, 2026 15:50 View deployment

vercel Bot deployed to Preview June 26, 2026 15:57 View deployment

vercel Bot deployed to Preview June 26, 2026 17:04 View deployment

vercel Bot deployed to Preview June 26, 2026 17:42 View deployment

vercel Bot deployed to Preview June 26, 2026 18:22 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore(eval): make the reproduction package self-contained and public-safe#90

chore(eval): make the reproduction package self-contained and public-safe#90
andylizf wants to merge 15 commits into
mainfrom
chore/eval-repro-cleanup

andylizf commented Jun 24, 2026

Uh oh!

vercel Bot commented Jun 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

andylizf commented Jun 24, 2026

Summary

Reproduction

Hygiene

Verification

Uh oh!

vercel Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Jun 24, 2026 •

edited

Loading