Skip to content

Demo extracts populated table, self-hosted run returns 'No document text` for same page #443

@luukschipperheyn

Description

@luukschipperheyn

Hi, I’m seeing a reproducible difference between the public demo and a self-hosted run on the same source page.

Input: table.pdf

table.pdf is an isolated single-page PDF containing page 11 from a source document.

Demo result

The public demo extracts a populated table:

<table>
  <tr>
    <th>Item No.<br>照合番号</th>
    <th>Title<br>部品名称</th>
    <th>No.off<br>数/台</th>
    <th>Remarks<br>摘要</th>
  </tr>
  <tr>
    <td>09001 - 1</td>
    <td>CRANK SHAFT クランクシフト</td>
    <td>1</td>
    <td></td>
  </tr>
  <tr>
    <td>09001 - 2</td>
    <td>GEAR * CRANK SHAFT ギヤ*クランクシフト</td>
    <td>1</td>
    <td></td>
  </tr>
  <tr>
    <td>09001 - 3</td>
    <td>BALANCE WEIGHT バランスウェイト</td>
    <td>12</td>
    <td></td>
  </tr>
  ...
</table>

Self-hosted result

Processing that same page on my self-hosted setup produces:

  • empty output JSONL
  • no markdown output
  • pipeline log says No document text

Relevant log excerpt:

2026-03-18 08:07:45,094 - __main__ - INFO - Worker 0 processing work item bcf50651623799f21f81f99e45f6df300dfbbc9b
2026-03-18 08:07:45,094 - __main__ - INFO - Created all tasks for bcf50651623799f21f81f99e45f6df300dfbbc9b
2026-03-18 08:07:45,612 - __main__ - INFO - No document text for /input/table.pdf
2026-03-18 08:07:45,612 - __main__ - INFO - Finished TaskGroup for worker on bcf50651623799f21f81f99e45f6df300dfbbc9b
2026-03-18 08:07:45,612 - __main__ - INFO - Got 0 docs for bcf50651623799f21f81f99e45f6df300dfbbc9b
2026-03-18 08:07:45,613 - __main__ - INFO - Writing 0 markdown files for bcf50651623799f21f81f99e45f6df300dfbbc9b

final summary:

Completed pages: 1
Failed pages: 0
Finished input tokens: 0
Finished output tokens: 0

Environment

  • olmocr version: 0.4.25
  • model: allenai/olmOCR-2-7B-1025-FP8
  • Docker-based setup on AWS EC2 (g6e.xlarge)
  • The setup runs vLLM and olmOCR in separate Docker containers on the same network
  • vLLM image: vllm/vllm-openai:latest
  • olmocr image: alleninstituteforai/olmocr:latest-with-model

vLLM command:

vllm serve allenai/olmOCR-2-7B-1025-FP8 \
  --host 0.0.0.0 \
  --port 8000 \
  --served-model-name olmocr \
  --gpu-memory-utilization 0.9 \
  --max-model-len 16384

pipeline command:

python -m olmocr.pipeline /output \
  --server http://vllm:8000/v1 \
  --model olmocr \
  --pdfs "/input/table.pdf"

Question

Is the public demo running the same model and pipeline path as the self-hosted setup above? If not, could you clarify what causes this discrepancy?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions