Skip to content

[Feature] Share PDF page extraction across short PDFs and long PDF page JSON #135

Description

@gwokhou

Problem

OpenKB currently has separate PDF parsing paths for short and long PDFs.

  • Short PDFs are converted directly to Markdown with the local PDF parser.
  • Long PDFs are routed to PageIndex for tree and summary generation.
  • Long PDF wiki/sources/<doc>.json page content is populated through a separate path:
    • PageIndex Cloud get_page_content() when PAGEINDEX_API_KEY is set
    • local PyMuPDF fallback via convert_pdf_to_pages() otherwise

As a result, improvements to the short-PDF parsing path do not automatically improve long PDF page content. For example, the pluggable parser work in #81 can improve the file-to-Markdown path, but long PDF wiki/sources/<doc>.json remains on its own extraction path.

This is a gap for users ingesting research papers and technical PDFs. Upgrading the PDF extraction engine should improve both short PDF Markdown and long PDF per-page JSON, without requiring separate integrations for each path.

Goal

Introduce a shared PDF page extraction layer used by both:

  1. Short PDF Markdown conversion
  2. Long PDF wiki/sources/<doc>.json generation

PageIndex should continue to own long-document tree and summary generation. This issue is about sharing the page-content extraction layer, not replacing PageIndex.

Proposed Design

Add a PDF page extractor interface, for example:

@dataclass
class PageContent:
    page: int
    content: str
    images: list[dict]

class PdfExtractor(Protocol):
    def parse_pages(
        self,
        pdf_path: Path,
        doc_name: str,
        images_dir: Path,
    ) -> list[PageContent]:
        ...

Then use the same extractor in both paths:

Short PDF:
  pages = pdf_extractor.parse_pages(...)
  markdown = pages_to_markdown(pages)
  write wiki/sources/<doc>.md

Long PDF:
  PageIndex still generates tree/summary
  pages = pdf_extractor.parse_pages(...)
  write wiki/sources/<doc>.json

Candidate Backends

Initial backend:

  • local: current PyMuPDF behavior, preserving existing output

Future backends:

  • mineru: adapt MinerU content_list.json / content_list_v2.json into OpenKB page JSON
  • unlimited_ocr: OCR fallback for scanned or low-text pages
  • other OCR/document parsers

Suggested Config

Possible config shape:

pdf_parser: local        # local | mineru | ...
long_pdf_page_parser: same

or:

pdf_parser:
  provider: local
  apply_to_long_pdf_pages: true

The important behavior is that a configured PDF extraction engine can affect both short PDF Markdown and long PDF page JSON.

Non-goals

  • Do not replace PageIndex tree/summary generation in this issue.
  • Do not change long PDF retrieval semantics.
  • Do not require heavyweight parser dependencies by default.
  • Do not make MinerU or OCR engines mandatory dependencies.

Benefits

  • Parser quality improvements propagate consistently across short and long PDFs.
  • Users can benefit from future PDF parser upgrades faster, without separate long-PDF integration work.
  • MinerU integration can improve long PDF page content without replacing PageIndex.
  • The PDF extraction layer becomes easier to test and benchmark.
  • Future OCR engines can be added in one place instead of patching separate code paths.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions