Skip to content

chore(eval): make the reproduction package self-contained and public-safe#90

Open
andylizf wants to merge 15 commits into
mainfrom
chore/eval-repro-cleanup
Open

chore(eval): make the reproduction package self-contained and public-safe#90
andylizf wants to merge 15 commits into
mainfrom
chore/eval-repro-cleanup

Conversation

@andylizf

Copy link
Copy Markdown
Contributor

Summary

Make the eval/ reproduction package self-contained and safe to read as a public artifact.

Reproduction

  • Single entry doc: rename eval/REPRODUCE.mdeval/README.md (so it auto-renders on
    the eval/ directory) and update all references.
  • Three documented ways to supply tile images to the reader: a self-hosted serve with
    materialized tiles, the public search API, or a self-hosted serve that renders tiles on
    demand from a kiwix ZIM. TILES_DIR is now optional — the reader can consume the
    serve-returned base64 tiles instead of staging a local corpus.
  • All tile rendering is serve-side now. Removed the reader-side "local-wiki" rendering
    path entirely: drops LocalWikiTiledScreenshotRetriever, the lookup_reference_url
    machinery, and the --local-wiki / --local-wiki-screenshot-dir / --lookup-reference-url
    flags (they relied on an out-of-tree module and hardcoded placeholder paths).
  • Benchmarks run on their full (filtered) sets. The 1000-example subsample is kept only
    for nq and sqa, matching what the paper reports.

Hygiene

  • Read the Jina API key from JINA_API_KEY instead of a hardcoded default.
  • Update stale notes now that the FAISS indexes and the tile corpus are published on HF.
  • Drop internal working-notes docs from the repo and scrub leftover references to old
    internal repo/module names.
  • Tidy .gitignore to use generic patterns, plus a scoped eval/.gitignore.

Verification

  • python -c "from lib import retrieval, retrievers, benchmarks, grader, simpleqa_data, simpleqa_filter" imports cleanly
  • run_bench.py --help works; --local-api / --text-api intact
  • bash -n reproduce.sh passes
  • no residual references to the removed feature

…safe

Reproduction:
- Consolidate to a single entry doc: rename eval/REPRODUCE.md -> eval/README.md
  (auto-renders on the eval/ dir) and update all references.
- Document the three ways to supply tile images to the reader: self-hosted serve
  with materialized tiles, the public search API, and a self-hosted serve that
  renders tiles on demand from a kiwix ZIM. Make TILES_DIR optional so the reader
  can use serve-returned base64 tiles instead of a local corpus.
- Remove the reader-side "local-wiki" rendering path entirely so all tile
  rendering happens serve-side: drop LocalWikiTiledScreenshotRetriever, the
  lookup_reference_url machinery, and the --local-wiki / --local-wiki-screenshot-dir
  / --lookup-reference-url flags (they relied on an out-of-tree module and
  hardcoded placeholder paths).
- Run benchmarks on their full (filtered) sets; keep the 1000-example subsample
  only for nq and sqa, which is what the paper reports.

Hygiene:
- Read the Jina API key from JINA_API_KEY instead of a hardcoded default.
- Update stale notes now that the indexes and tile corpus are published on HF.
- Drop internal working-notes docs from the repo and scrub leftover references to
  old internal repo/module names.
- Tidy .gitignore to use generic patterns and add a scoped eval/.gitignore.
@vercel

vercel Bot commented Jun 24, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
web Ready Ready Preview, Comment Jun 26, 2026 6:22pm

… the reader

Two pre-existing bugs that silently broke cells, surfaced by a from-scratch
reproduction of NQ/base via the public API:

- grader: NQ / NQ-Tables store gold answers under `gold_answers`, which
  `_golds_for` did not read, so every exact-match cell graded 0. Add the key.
- reader: retrieved tiles were attached only when `os.path.exists(path)` is true,
  but in the public-API and on-demand-render modes the serve returns each tile
  inline as base64 (the "path" is the base64 string itself). Those tiles were
  silently dropped and the reader answered from parametric memory (effectively
  naive). Add a `_tile_image_b64` helper that accepts both a local file path and
  inline base64, and use it at all four tile-attachment sites (build_messages's
  three branches + `_encode_images_to_content`).

Verified: NQ/base smoke (20 examples, public API, base64 tiles) now grades
10/20 = 50% with tiles reaching the reader, vs 0 before.
…he public-API path

- Loosen eval/pyproject deps from exact `==` pins to `>=` floors. uv.lock is the
  reproducibility contract (`uv sync --frozen`); the blanket `==` pins were a freeze
  artifact, inconsistent with the rest of the workspace (which floors), and they conflicted
  with vLLM (numpy 2.4.6 vs vLLM's numpy<2.3).
- Add an optional `reader` extra (`uv sync --extra reader`) that installs vLLM 0.19.0, so the
  Qwen3.5-4B reader can be self-hosted from this package. vLLM needs numpy<2.3, hence the cap
  (nothing in the base requires >=2.3).
- README: document installing the reader via the extra; add the command to run a cell against
  the public API (reproduce.sh hardcodes localhost, so it can't do the public-API path); note
  that public :30001 serves the un-normed base index, which does not match the paper's
  base/lora cells (those use search_index_normed_v2).
The paper's published NQ/NQT numbers use the gpt-4.1 LLM judge (semantic match), not strict
exact-match. The reader answers short but paraphrases the gold span, so strict exact-match
scores ~20pp lower (≈ the naive number) even when the answer is correct — measured NQ base
= 29% exact-match vs 53% LLM-judge (paper 57.9).

- grader: add a --llm-judge flag; for nq/nq_tables/triviaqa it grades with the existing
  gpt-4.1 judge (ground truth = "Any of: <gold aliases>") instead of exact-match. The
  default stays strict exact-match (no API key needed).
- reproduce.sh: pass --llm-judge for nq/nqt so the turnkey reproduction matches the paper.
- README: note NQ/NQT paper numbers are LLM-judge; update the cell table + grader section.
…elf-hosting the search serve

- The `serve` extra was missing torchvision, which transformers' Qwen3-VL processor imports —
  self-hosting the search serve failed until it was added by hand. Add it (cu129 via the
  existing tool.uv.sources, matching the embed extra).
- eval/README: add a "self-hosting the search serve" note — the [serve] install, the
  ~217G index + ~220G RAM, where articles.json lives (the pixelrag-tiles dataset), and the
  tiles options (full corpus vs --render-on-demand from a kiwix ZIM).
Two real bugs that made the serve's --render-on-demand path (Mode 3 in the eval README)
return empty tiles, surfaced by a from-scratch reproduction:

- serve: the async /search handler called _ondemand_chunk_b64 synchronously, which runs
  render_url -> asyncio.run() inside the running event loop -> RuntimeError, swallowed, so
  image_base64 was always empty (the reader fell back to closed-book). Offload it with
  asyncio.to_thread so render_url's asyncio.run() runs in a thread with no running loop.
- render: the cdp/fast_cdp backends force GPU rasterization, which needs GPU device access
  (the `render` group on lab machines: `sudo usermod -aG render $USER`). On a headless box
  without it the renderer crashes and CDP capture hangs forever. Add PIXELSHOT_DISABLE_GPU=1
  to fall back to CPU rasterization; the default (GPU) path is unchanged.
…o GPU)

Our render/serve machines are GPU-less (the serve runs --device cpu; nvidia-smi has no
driver), and the stable production capture config is GPU-rasterization-incompatible anyway.
Forcing `--enable-gpu-rasterization` by default makes Chrome crash / CDP capture hang on a
box without real GPU device access.

Flip the default to `--disable-gpu` (CPU rasterization — works on GPU-less and headless
boxes, the common case). On a properly configured graphics-GPU render box, set
PIXELSHOT_ENABLE_GPU=1 for ~2x throughput (GPU rasterization can also produce blank
captures, so verify output there).
andylizf added 2 commits June 26, 2026 05:39
…es render fine)

A previous commit flipped the cdp/fast_cdp default to --disable-gpu on the theory that
--enable-gpu-rasterization crashes on GPU-less / headless machines. That was wrong: verified
on a no-GPU box, the original flags render correctly and fast (Chrome falls back to software
rasterization, no crash). The crash seen on one borrowed cluster box was specific to that
host (patched headless_shell + datacenter GPU), not a general problem — not a reason to change
the default. Restore cdp.py / fast_cdp.py to the original BROWSER_ARGS.

The genuine Mode-3 bug (the serve's async handler calling render_url -> asyncio.run inside a
running event loop) stays fixed via asyncio.to_thread in serve/api.py — that is unrelated to
the GPU flags.
…ently go closed-book

LocalAPIRetriever/TextAPIRetriever POST to the search serve with a hardcoded 600s timeout and
batch_size 32. Against a slow serve (e.g. --render-on-demand, which renders a page per tile),
a batch exceeds 600s, the request times out, the batch is cached EMPTY, and the reader answers
closed-book — a silently invalid run (looks like a bad score, not an error). Make the timeout
configurable via PIXELRAG_RETRIEVAL_TIMEOUT (default 600, unchanged for fast serves); document
raising it for mode 3 in the README.
…'t wedge

The on-demand renderer called render_url() in-process. render_url uses asyncio.run() + multiprocessing.Pool (fork); the SECOND in-process call deadlocks (fork-in-threaded-process), so a long-lived serve renders the first retrieved page fine and then wedges on the next request. Verified on a no-GPU box: same-process 2nd render hangs (SIGKILL after 40s), a fresh subprocess per render runs 3/3 clean (~1.85s each). Run each render in a subprocess (timeout via PIXELRAG_RENDER_TIMEOUT, default 120s). Pairs with the to_thread offload in api.py that keeps the blocking subprocess off the event loop.
…line python -c

The pixelshot CLI already exposes --viewport-width / --tile-height (defaults 875/8192), so use the standard module entry (python -m pixelrag_render.render --output ... --viewport-width ... --tile-height ... --workers 1) instead of an inline python -c string. Same subprocess isolation that fixes the wedge, cleaner invocation. Verified: 2 consecutive renders 1.67s/1.57s, correct *.png.tiles output, no hang.
… that crashes some boxes)

git shows --enable-gpu-rasterization/--force-gpu-rasterization have been in the cdp/fast_cdp backends since the initial release (82c5794, then carried through PR #38) — inherited on the assumption GPU rasterization speeds up capture. It doesn't: headless Chrome falls back to the software renderer and ignores the flags (verified on a no-GPU box: enable == disable == 1.7s; the bottleneck is capture IPC, not rasterization — see docs/screenshot-throughput-optimization.md). On a box that has a GPU device but no access (e.g. /dev/dri without the render group), Chrome tries the GPU, the GPU process crashes on init, and capture hangs. Default to --disable-gpu; opt in with PIXELSHOT_ENABLE_GPU=1 only on a real graphics-GPU box with device access. Default render verified working (1.62s/tile).
… harness

Clarify that the documented t/s excludes Chrome startup (the bench timer starts after strategy.setup()). A hand-rolled end-to-end loop that counts the ~49s 48-worker startup reports ~13 t/s, which is the startup tax not the capture rate. Re-measured on the reference EPYC 7763 box with the harness: 130 t/s capture-only (200 maxi-ZIM pages, 48w, raw).
…l), ~3x faster

On-demand render started a fresh Chrome per page; the cold start dominates (seconds locally, tens of seconds on a cold/NFS box — that's why a Mode-3 NQ run was ~2.3min/example, ~95% of it in rendering). Keep one headless Chrome alive in OnDemandTiles and render each page in a fresh tab via the pixelshot --cdp-url attach mode (PR #76). Still a separate subprocess per render (render_url's asyncio.run + multiprocessing.Pool would deadlock on a 2nd in-process call), but it attaches to the live browser instead of launching one. Verified on a no-GPU box: 6 consecutive renders 0.47s each (vs ~1.6s/page launching fresh) with no hang; Chrome is auto-restarted if a render fails, killed at exit.
…r-instant startup

read_index reads the whole index into RAM — a multi-100G index over NFS takes ~1-2h to load. With PIXELRAG_INDEX_MMAP=1, faiss.read_index uses IO_FLAG_MMAP: startup is near-instant (no full read), inverted lists are paged in on demand at query time, and the OS page cache keeps hot lists resident. Ideal for a few-hundred-query eval that touches only a fraction of the index.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant