Demo extracts populated table, self-hosted run returns 'No document text` for same page

Hi, I’m seeing a reproducible difference between the public demo and a self-hosted run on the same source page.

Input: [table.pdf](https://github.com/user-attachments/files/26090839/table.pdf)

`table.pdf` is an isolated single-page PDF containing page 11 from a source document.

### Demo result

The public demo extracts a populated table:
```html
<table>
  <tr>
    <th>Item No.<br>照合番号</th>
    <th>Title<br>部品名称</th>
    <th>No.off<br>数/台</th>
    <th>Remarks<br>摘要</th>
  </tr>
  <tr>
    <td>09001 - 1</td>
    <td>CRANK SHAFT クランクシフト</td>
    <td>1</td>
    <td></td>
  </tr>
  <tr>
    <td>09001 - 2</td>
    <td>GEAR * CRANK SHAFT ギヤ*クランクシフト</td>
    <td>1</td>
    <td></td>
  </tr>
  <tr>
    <td>09001 - 3</td>
    <td>BALANCE WEIGHT バランスウェイト</td>
    <td>12</td>
    <td></td>
  </tr>
  ...
</table>
```

### Self-hosted result

Processing that same page on my self-hosted setup produces:

- empty output JSONL
- no markdown output
- pipeline log says No document text

Relevant log excerpt:

```log
2026-03-18 08:07:45,094 - __main__ - INFO - Worker 0 processing work item bcf50651623799f21f81f99e45f6df300dfbbc9b
2026-03-18 08:07:45,094 - __main__ - INFO - Created all tasks for bcf50651623799f21f81f99e45f6df300dfbbc9b
2026-03-18 08:07:45,612 - __main__ - INFO - No document text for /input/table.pdf
2026-03-18 08:07:45,612 - __main__ - INFO - Finished TaskGroup for worker on bcf50651623799f21f81f99e45f6df300dfbbc9b
2026-03-18 08:07:45,612 - __main__ - INFO - Got 0 docs for bcf50651623799f21f81f99e45f6df300dfbbc9b
2026-03-18 08:07:45,613 - __main__ - INFO - Writing 0 markdown files for bcf50651623799f21f81f99e45f6df300dfbbc9b
```

final summary:

```log
Completed pages: 1
Failed pages: 0
Finished input tokens: 0
Finished output tokens: 0
```

### Environment
- olmocr version: 0.4.25
- model: allenai/olmOCR-2-7B-1025-FP8
- Docker-based setup on AWS EC2 (`g6e.xlarge`)
- The setup runs vLLM and olmOCR in separate Docker containers on the same network
- vLLM image: vllm/vllm-openai:latest
- olmocr image: alleninstituteforai/olmocr:latest-with-model

vLLM command:
```bash
vllm serve allenai/olmOCR-2-7B-1025-FP8 \
  --host 0.0.0.0 \
  --port 8000 \
  --served-model-name olmocr \
  --gpu-memory-utilization 0.9 \
  --max-model-len 16384
```

pipeline command:
```bash
python -m olmocr.pipeline /output \
  --server http://vllm:8000/v1 \
  --model olmocr \
  --pdfs "/input/table.pdf"
```

### Question

Is the public demo running the same model and pipeline path as the self-hosted setup above? If not, could you clarify what causes this discrepancy?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Demo extracts populated table, self-hosted run returns 'No document text` for same page #443

Demo result

Self-hosted result

Environment

Question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Demo extracts populated table, self-hosted run returns 'No document text` for same page #443

Description

Demo result

Self-hosted result

Environment

Question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions