[Feature] Share PDF page extraction across short PDFs and long PDF page JSON

## Problem

OpenKB currently has separate PDF parsing paths for short and long PDFs.

- Short PDFs are converted directly to Markdown with the local PDF parser.
- Long PDFs are routed to PageIndex for tree and summary generation.
- Long PDF `wiki/sources/<doc>.json` page content is populated through a separate path:
  - PageIndex Cloud `get_page_content()` when `PAGEINDEX_API_KEY` is set
  - local PyMuPDF fallback via `convert_pdf_to_pages()` otherwise

As a result, improvements to the short-PDF parsing path do not automatically improve long PDF page content. For example, the pluggable parser work in #81 can improve the file-to-Markdown path, but long PDF `wiki/sources/<doc>.json` remains on its own extraction path.

This is a gap for users ingesting research papers and technical PDFs. Upgrading the PDF extraction engine should improve both short PDF Markdown and long PDF per-page JSON, without requiring separate integrations for each path.

## Goal

Introduce a shared PDF page extraction layer used by both:

1. Short PDF Markdown conversion
2. Long PDF `wiki/sources/<doc>.json` generation

PageIndex should continue to own long-document tree and summary generation. This issue is about sharing the page-content extraction layer, not replacing PageIndex.

## Proposed Design

Add a PDF page extractor interface, for example:

```python
@dataclass
class PageContent:
    page: int
    content: str
    images: list[dict]

class PdfExtractor(Protocol):
    def parse_pages(
        self,
        pdf_path: Path,
        doc_name: str,
        images_dir: Path,
    ) -> list[PageContent]:
        ...
```

Then use the same extractor in both paths:

```text
Short PDF:
  pages = pdf_extractor.parse_pages(...)
  markdown = pages_to_markdown(pages)
  write wiki/sources/<doc>.md

Long PDF:
  PageIndex still generates tree/summary
  pages = pdf_extractor.parse_pages(...)
  write wiki/sources/<doc>.json
```

## Candidate Backends

Initial backend:

- `local`: current PyMuPDF behavior, preserving existing output

Future backends:

- `mineru`: adapt MinerU `content_list.json` / `content_list_v2.json` into OpenKB page JSON
- `unlimited_ocr`: OCR fallback for scanned or low-text pages
- other OCR/document parsers

## Suggested Config

Possible config shape:

```yaml
pdf_parser: local        # local | mineru | ...
long_pdf_page_parser: same
```

or:

```yaml
pdf_parser:
  provider: local
  apply_to_long_pdf_pages: true
```

The important behavior is that a configured PDF extraction engine can affect both short PDF Markdown and long PDF page JSON.

## Non-goals

- Do not replace PageIndex tree/summary generation in this issue.
- Do not change long PDF retrieval semantics.
- Do not require heavyweight parser dependencies by default.
- Do not make MinerU or OCR engines mandatory dependencies.

## Benefits

- Parser quality improvements propagate consistently across short and long PDFs.
- Users can benefit from future PDF parser upgrades faster, without separate long-PDF integration work.
- MinerU integration can improve long PDF page content without replacing PageIndex.
- The PDF extraction layer becomes easier to test and benchmark.
- Future OCR engines can be added in one place instead of patching separate code paths.

## Related

- #77 requested better PDF parsing, especially for research papers.
- #81 adds pluggable document parsers for the file-to-Markdown path, but explicitly leaves long PDFs/PageIndex untouched.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Share PDF page extraction across short PDFs and long PDF page JSON #135

Problem

Goal

Proposed Design

Candidate Backends

Suggested Config

Non-goals

Benefits

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature] Share PDF page extraction across short PDFs and long PDF page JSON #135

Description

Problem

Goal

Proposed Design

Candidate Backends

Suggested Config

Non-goals

Benefits

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions