Fix #433 Prevent per-page hangs & avoid killing job on maxbackoff by akshan-main · Pull Request #438 · allenai/olmocr

akshan-main · 2026-02-13T22:08:14Z

Closes #433

Changes proposed in this pull request:

apost() now takes a timeout_s param and wraps the entire network path in asyncio.timeout(), so a stalled server cant block forever
When max backoff is exhausted, we return None instead of sys.exit(1) - the existing fallback path (make_fallback_result) handles it from there, so the rest of the PDF still gets processed
New --request_timeout_s CLI flag (default 120s) to control per-request timeout

Before submitting

I've read and followed all steps in the Making a pull request
section of the CONTRIBUTING docs.
I've updated or added any relevant docstrings following the syntax described in the
Writing docstrings section of the CONTRIBUTING docs.
If this PR fixes a bug, I've added a test that will fail without my fix.
If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.

akshan-main · 2026-02-14T15:51:55Z

@jakep-allenai

jakep-allenai · 2026-02-16T05:13:18Z

Thanks for this suggestion, let me think on it for a day or two. The reason the job exits now is because in these giant huge runs we do with hundreds of millions of documents, I found it easier to have the job die and have this show up as an obvious error right away, compared to having half complete or empty files get generated, if some consistent backend issue occurred. It happened to us that there could be weird cluster issues where jobs worked fine, then produced empty or almost incomplete jsonl result files, then went back to working and that wasn't fun.

Can you explain more about the cases you ran into?

akshan-main · 2026-02-16T18:26:09Z

Hey, I get why you’d rather crash early in giant runs. In my case, it wasn’t bad output, but there was no output because of hang. apost() waits on socket reads without timeout, so if the server stalls mid-response, the coroutine blocks forever(no per request deadline). With concurrency effectively at 1, it looks like it’s stuck on the last page, but it’s really just whichever page hit the wedged request first. That’s why I think the timeout is important. For the max-backoff, I changed sys.exit(1) because there is already fallback handling, and I wanted a one-off failure to not kill the entire PDF. But let me know if its better to make that behavior opt-in (using a flag) or put a threshold in so repeated failures still stop the job loudly. I can align my solution based on that and create a pr for that as well

akshan-main · 2026-03-06T02:58:48Z

Hey @jakep-allenai was just curious to hear your thoughts on this now

jakep-allenai · 2026-03-06T16:09:08Z

I've thought about it, I don't think I can make the timeout default on most paths, because for example on our runs in a big cluster, we might have 1600 concurrent requests fired off in parallel, and the server might respond to the last one only after a while. However, I see that it would be more important to have a timeout on the inference-provider external server case. And yes, sometimes servers do crash, but I imagine that would just close those sockets and return a ConnectionClosed error which would then hit the exponential backoff case.

How exactly did VLLM get wedged for you? Was it running using the subprocess inside the pipeline, or were you running an external VLLM? Any VLLM logs you can share?

Prevent per-page hangs and avoid killing the job on max backoff

5bb4397

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #433 Prevent per-page hangs & avoid killing job on maxbackoff#438

Fix #433 Prevent per-page hangs & avoid killing job on maxbackoff#438
akshan-main wants to merge 1 commit intoallenai:mainfrom
akshan-main:request_timeout_and_backoff_fix

akshan-main commented Feb 13, 2026

Uh oh!

akshan-main commented Feb 14, 2026

Uh oh!

jakep-allenai commented Feb 16, 2026

Uh oh!

akshan-main commented Feb 16, 2026 •

edited

Loading

Uh oh!

akshan-main commented Mar 6, 2026

Uh oh!

jakep-allenai commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

akshan-main commented Feb 13, 2026

Before submitting

Uh oh!

akshan-main commented Feb 14, 2026

Uh oh!

jakep-allenai commented Feb 16, 2026

Uh oh!

akshan-main commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akshan-main commented Mar 6, 2026

Uh oh!

jakep-allenai commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

akshan-main commented Feb 16, 2026 •

edited

Loading