chore(eval): make the reproduction package self-contained and public-safe#90
Open
andylizf wants to merge 15 commits into
Open
chore(eval): make the reproduction package self-contained and public-safe#90andylizf wants to merge 15 commits into
andylizf wants to merge 15 commits into
Conversation
…safe Reproduction: - Consolidate to a single entry doc: rename eval/REPRODUCE.md -> eval/README.md (auto-renders on the eval/ dir) and update all references. - Document the three ways to supply tile images to the reader: self-hosted serve with materialized tiles, the public search API, and a self-hosted serve that renders tiles on demand from a kiwix ZIM. Make TILES_DIR optional so the reader can use serve-returned base64 tiles instead of a local corpus. - Remove the reader-side "local-wiki" rendering path entirely so all tile rendering happens serve-side: drop LocalWikiTiledScreenshotRetriever, the lookup_reference_url machinery, and the --local-wiki / --local-wiki-screenshot-dir / --lookup-reference-url flags (they relied on an out-of-tree module and hardcoded placeholder paths). - Run benchmarks on their full (filtered) sets; keep the 1000-example subsample only for nq and sqa, which is what the paper reports. Hygiene: - Read the Jina API key from JINA_API_KEY instead of a hardcoded default. - Update stale notes now that the indexes and tile corpus are published on HF. - Drop internal working-notes docs from the repo and scrub leftover references to old internal repo/module names. - Tidy .gitignore to use generic patterns and add a scoped eval/.gitignore.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
… the reader Two pre-existing bugs that silently broke cells, surfaced by a from-scratch reproduction of NQ/base via the public API: - grader: NQ / NQ-Tables store gold answers under `gold_answers`, which `_golds_for` did not read, so every exact-match cell graded 0. Add the key. - reader: retrieved tiles were attached only when `os.path.exists(path)` is true, but in the public-API and on-demand-render modes the serve returns each tile inline as base64 (the "path" is the base64 string itself). Those tiles were silently dropped and the reader answered from parametric memory (effectively naive). Add a `_tile_image_b64` helper that accepts both a local file path and inline base64, and use it at all four tile-attachment sites (build_messages's three branches + `_encode_images_to_content`). Verified: NQ/base smoke (20 examples, public API, base64 tiles) now grades 10/20 = 50% with tiles reaching the reader, vs 0 before.
…he public-API path - Loosen eval/pyproject deps from exact `==` pins to `>=` floors. uv.lock is the reproducibility contract (`uv sync --frozen`); the blanket `==` pins were a freeze artifact, inconsistent with the rest of the workspace (which floors), and they conflicted with vLLM (numpy 2.4.6 vs vLLM's numpy<2.3). - Add an optional `reader` extra (`uv sync --extra reader`) that installs vLLM 0.19.0, so the Qwen3.5-4B reader can be self-hosted from this package. vLLM needs numpy<2.3, hence the cap (nothing in the base requires >=2.3). - README: document installing the reader via the extra; add the command to run a cell against the public API (reproduce.sh hardcodes localhost, so it can't do the public-API path); note that public :30001 serves the un-normed base index, which does not match the paper's base/lora cells (those use search_index_normed_v2).
The paper's published NQ/NQT numbers use the gpt-4.1 LLM judge (semantic match), not strict exact-match. The reader answers short but paraphrases the gold span, so strict exact-match scores ~20pp lower (≈ the naive number) even when the answer is correct — measured NQ base = 29% exact-match vs 53% LLM-judge (paper 57.9). - grader: add a --llm-judge flag; for nq/nq_tables/triviaqa it grades with the existing gpt-4.1 judge (ground truth = "Any of: <gold aliases>") instead of exact-match. The default stays strict exact-match (no API key needed). - reproduce.sh: pass --llm-judge for nq/nqt so the turnkey reproduction matches the paper. - README: note NQ/NQT paper numbers are LLM-judge; update the cell table + grader section.
…elf-hosting the search serve - The `serve` extra was missing torchvision, which transformers' Qwen3-VL processor imports — self-hosting the search serve failed until it was added by hand. Add it (cu129 via the existing tool.uv.sources, matching the embed extra). - eval/README: add a "self-hosting the search serve" note — the [serve] install, the ~217G index + ~220G RAM, where articles.json lives (the pixelrag-tiles dataset), and the tiles options (full corpus vs --render-on-demand from a kiwix ZIM).
Two real bugs that made the serve's --render-on-demand path (Mode 3 in the eval README) return empty tiles, surfaced by a from-scratch reproduction: - serve: the async /search handler called _ondemand_chunk_b64 synchronously, which runs render_url -> asyncio.run() inside the running event loop -> RuntimeError, swallowed, so image_base64 was always empty (the reader fell back to closed-book). Offload it with asyncio.to_thread so render_url's asyncio.run() runs in a thread with no running loop. - render: the cdp/fast_cdp backends force GPU rasterization, which needs GPU device access (the `render` group on lab machines: `sudo usermod -aG render $USER`). On a headless box without it the renderer crashes and CDP capture hangs forever. Add PIXELSHOT_DISABLE_GPU=1 to fall back to CPU rasterization; the default (GPU) path is unchanged.
…o GPU) Our render/serve machines are GPU-less (the serve runs --device cpu; nvidia-smi has no driver), and the stable production capture config is GPU-rasterization-incompatible anyway. Forcing `--enable-gpu-rasterization` by default makes Chrome crash / CDP capture hang on a box without real GPU device access. Flip the default to `--disable-gpu` (CPU rasterization — works on GPU-less and headless boxes, the common case). On a properly configured graphics-GPU render box, set PIXELSHOT_ENABLE_GPU=1 for ~2x throughput (GPU rasterization can also produce blank captures, so verify output there).
…es render fine) A previous commit flipped the cdp/fast_cdp default to --disable-gpu on the theory that --enable-gpu-rasterization crashes on GPU-less / headless machines. That was wrong: verified on a no-GPU box, the original flags render correctly and fast (Chrome falls back to software rasterization, no crash). The crash seen on one borrowed cluster box was specific to that host (patched headless_shell + datacenter GPU), not a general problem — not a reason to change the default. Restore cdp.py / fast_cdp.py to the original BROWSER_ARGS. The genuine Mode-3 bug (the serve's async handler calling render_url -> asyncio.run inside a running event loop) stays fixed via asyncio.to_thread in serve/api.py — that is unrelated to the GPU flags.
…ently go closed-book LocalAPIRetriever/TextAPIRetriever POST to the search serve with a hardcoded 600s timeout and batch_size 32. Against a slow serve (e.g. --render-on-demand, which renders a page per tile), a batch exceeds 600s, the request times out, the batch is cached EMPTY, and the reader answers closed-book — a silently invalid run (looks like a bad score, not an error). Make the timeout configurable via PIXELRAG_RETRIEVAL_TIMEOUT (default 600, unchanged for fast serves); document raising it for mode 3 in the README.
…'t wedge The on-demand renderer called render_url() in-process. render_url uses asyncio.run() + multiprocessing.Pool (fork); the SECOND in-process call deadlocks (fork-in-threaded-process), so a long-lived serve renders the first retrieved page fine and then wedges on the next request. Verified on a no-GPU box: same-process 2nd render hangs (SIGKILL after 40s), a fresh subprocess per render runs 3/3 clean (~1.85s each). Run each render in a subprocess (timeout via PIXELRAG_RENDER_TIMEOUT, default 120s). Pairs with the to_thread offload in api.py that keeps the blocking subprocess off the event loop.
…line python -c The pixelshot CLI already exposes --viewport-width / --tile-height (defaults 875/8192), so use the standard module entry (python -m pixelrag_render.render --output ... --viewport-width ... --tile-height ... --workers 1) instead of an inline python -c string. Same subprocess isolation that fixes the wedge, cleaner invocation. Verified: 2 consecutive renders 1.67s/1.57s, correct *.png.tiles output, no hang.
… that crashes some boxes) git shows --enable-gpu-rasterization/--force-gpu-rasterization have been in the cdp/fast_cdp backends since the initial release (82c5794, then carried through PR #38) — inherited on the assumption GPU rasterization speeds up capture. It doesn't: headless Chrome falls back to the software renderer and ignores the flags (verified on a no-GPU box: enable == disable == 1.7s; the bottleneck is capture IPC, not rasterization — see docs/screenshot-throughput-optimization.md). On a box that has a GPU device but no access (e.g. /dev/dri without the render group), Chrome tries the GPU, the GPU process crashes on init, and capture hangs. Default to --disable-gpu; opt in with PIXELSHOT_ENABLE_GPU=1 only on a real graphics-GPU box with device access. Default render verified working (1.62s/tile).
… harness Clarify that the documented t/s excludes Chrome startup (the bench timer starts after strategy.setup()). A hand-rolled end-to-end loop that counts the ~49s 48-worker startup reports ~13 t/s, which is the startup tax not the capture rate. Re-measured on the reference EPYC 7763 box with the harness: 130 t/s capture-only (200 maxi-ZIM pages, 48w, raw).
…l), ~3x faster On-demand render started a fresh Chrome per page; the cold start dominates (seconds locally, tens of seconds on a cold/NFS box — that's why a Mode-3 NQ run was ~2.3min/example, ~95% of it in rendering). Keep one headless Chrome alive in OnDemandTiles and render each page in a fresh tab via the pixelshot --cdp-url attach mode (PR #76). Still a separate subprocess per render (render_url's asyncio.run + multiprocessing.Pool would deadlock on a 2nd in-process call), but it attaches to the live browser instead of launching one. Verified on a no-GPU box: 6 consecutive renders 0.47s each (vs ~1.6s/page launching fresh) with no hang; Chrome is auto-restarted if a render fails, killed at exit.
…r-instant startup read_index reads the whole index into RAM — a multi-100G index over NFS takes ~1-2h to load. With PIXELRAG_INDEX_MMAP=1, faiss.read_index uses IO_FLAG_MMAP: startup is near-instant (no full read), inverted lists are paged in on demand at query time, and the OS page cache keeps hot lists resident. Ideal for a few-hundred-query eval that touches only a fraction of the index.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Make the
eval/reproduction package self-contained and safe to read as a public artifact.Reproduction
eval/REPRODUCE.md→eval/README.md(so it auto-renders onthe
eval/directory) and update all references.materialized tiles, the public search API, or a self-hosted serve that renders tiles on
demand from a kiwix ZIM.
TILES_DIRis now optional — the reader can consume theserve-returned base64 tiles instead of staging a local corpus.
path entirely: drops
LocalWikiTiledScreenshotRetriever, thelookup_reference_urlmachinery, and the
--local-wiki/--local-wiki-screenshot-dir/--lookup-reference-urlflags (they relied on an out-of-tree module and hardcoded placeholder paths).
for
nqandsqa, matching what the paper reports.Hygiene
JINA_API_KEYinstead of a hardcoded default.internal repo/module names.
.gitignoreto use generic patterns, plus a scopedeval/.gitignore.Verification
python -c "from lib import retrieval, retrievers, benchmarks, grader, simpleqa_data, simpleqa_filter"imports cleanlyrun_bench.py --helpworks;--local-api/--text-apiintactbash -n reproduce.shpasses