Problem
OpenKB currently has separate PDF parsing paths for short and long PDFs.
- Short PDFs are converted directly to Markdown with the local PDF parser.
- Long PDFs are routed to PageIndex for tree and summary generation.
- Long PDF
wiki/sources/<doc>.json page content is populated through a separate path:
- PageIndex Cloud
get_page_content() when PAGEINDEX_API_KEY is set
- local PyMuPDF fallback via
convert_pdf_to_pages() otherwise
As a result, improvements to the short-PDF parsing path do not automatically improve long PDF page content. For example, the pluggable parser work in #81 can improve the file-to-Markdown path, but long PDF wiki/sources/<doc>.json remains on its own extraction path.
This is a gap for users ingesting research papers and technical PDFs. Upgrading the PDF extraction engine should improve both short PDF Markdown and long PDF per-page JSON, without requiring separate integrations for each path.
Goal
Introduce a shared PDF page extraction layer used by both:
- Short PDF Markdown conversion
- Long PDF
wiki/sources/<doc>.json generation
PageIndex should continue to own long-document tree and summary generation. This issue is about sharing the page-content extraction layer, not replacing PageIndex.
Proposed Design
Add a PDF page extractor interface, for example:
@dataclass
class PageContent:
page: int
content: str
images: list[dict]
class PdfExtractor(Protocol):
def parse_pages(
self,
pdf_path: Path,
doc_name: str,
images_dir: Path,
) -> list[PageContent]:
...
Then use the same extractor in both paths:
Short PDF:
pages = pdf_extractor.parse_pages(...)
markdown = pages_to_markdown(pages)
write wiki/sources/<doc>.md
Long PDF:
PageIndex still generates tree/summary
pages = pdf_extractor.parse_pages(...)
write wiki/sources/<doc>.json
Candidate Backends
Initial backend:
local: current PyMuPDF behavior, preserving existing output
Future backends:
mineru: adapt MinerU content_list.json / content_list_v2.json into OpenKB page JSON
unlimited_ocr: OCR fallback for scanned or low-text pages
- other OCR/document parsers
Suggested Config
Possible config shape:
pdf_parser: local # local | mineru | ...
long_pdf_page_parser: same
or:
pdf_parser:
provider: local
apply_to_long_pdf_pages: true
The important behavior is that a configured PDF extraction engine can affect both short PDF Markdown and long PDF page JSON.
Non-goals
- Do not replace PageIndex tree/summary generation in this issue.
- Do not change long PDF retrieval semantics.
- Do not require heavyweight parser dependencies by default.
- Do not make MinerU or OCR engines mandatory dependencies.
Benefits
- Parser quality improvements propagate consistently across short and long PDFs.
- Users can benefit from future PDF parser upgrades faster, without separate long-PDF integration work.
- MinerU integration can improve long PDF page content without replacing PageIndex.
- The PDF extraction layer becomes easier to test and benchmark.
- Future OCR engines can be added in one place instead of patching separate code paths.
Related
Problem
OpenKB currently has separate PDF parsing paths for short and long PDFs.
wiki/sources/<doc>.jsonpage content is populated through a separate path:get_page_content()whenPAGEINDEX_API_KEYis setconvert_pdf_to_pages()otherwiseAs a result, improvements to the short-PDF parsing path do not automatically improve long PDF page content. For example, the pluggable parser work in #81 can improve the file-to-Markdown path, but long PDF
wiki/sources/<doc>.jsonremains on its own extraction path.This is a gap for users ingesting research papers and technical PDFs. Upgrading the PDF extraction engine should improve both short PDF Markdown and long PDF per-page JSON, without requiring separate integrations for each path.
Goal
Introduce a shared PDF page extraction layer used by both:
wiki/sources/<doc>.jsongenerationPageIndex should continue to own long-document tree and summary generation. This issue is about sharing the page-content extraction layer, not replacing PageIndex.
Proposed Design
Add a PDF page extractor interface, for example:
Then use the same extractor in both paths:
Candidate Backends
Initial backend:
local: current PyMuPDF behavior, preserving existing outputFuture backends:
mineru: adapt MinerUcontent_list.json/content_list_v2.jsoninto OpenKB page JSONunlimited_ocr: OCR fallback for scanned or low-text pagesSuggested Config
Possible config shape:
or:
The important behavior is that a configured PDF extraction engine can affect both short PDF Markdown and long PDF page JSON.
Non-goals
Benefits
Related