feat: Share PDF page extraction across PDF ingest paths by gwokhou · Pull Request #136 · VectifyAI/OpenKB

gwokhou · 2026-06-23T13:56:27Z

Summary

Refs #135.

This PR introduces a shared PDF page extraction contract and routes PDF ingest paths through it where doing so preserves existing behavior:

adds PageContent, PdfExtractor, LocalPdfExtractor, pages_to_markdown, and pages_to_json
routes short PDF Markdown conversion through the configured shared extractor
routes long PDF local fallback page JSON through the same extractor
preserves PageIndex Cloud get_page_content() as the first choice when PAGEINDEX_API_KEY is set, then falls back to the configured extractor
updates tests around PDF extraction, short PDF conversion, long PDF source JSON, and mocked async compile coroutines

Reviewer Note

Issue #135 asks for a configured PDF extraction engine to affect both short PDF Markdown and long PDF page JSON. This PR preserves the existing PageIndex Cloud page-content path when PAGEINDEX_API_KEY is set because that path provides cloud OCR behavior for scanned or complex PDFs.

That means pdf_parser controls short PDFs and long-PDF local fallback JSON, but does not override cloud-mode get_page_content() output. Please confirm whether preserving cloud OCR priority is the desired compatibility behavior, or whether #135 should require pdf_parser to control cloud-mode long PDF page JSON too.

Validation

UV_CACHE_DIR=/tmp/uv-cache UV_PYTHON=3.13 uv run --extra dev pytest tests/test_pdf_extractor.py tests/test_converter.py tests/test_indexer.py tests/test_config.py tests/test_add_command.py tests/test_remove.py tests/test_url_ingest.py

Result: 161 passed, 6 warnings in 1.35s.

gwokhou added 5 commits June 23, 2026 21:49

feat(pdf): add shared extraction contract

4be55f7

feat(pdf): route short PDFs through shared extractor

b472a41

feat(pdf): route long PDF page JSON through shared extractor

9ab571c

test(cli): close mocked async compile coroutines

8930a88

Merge branch 'VectifyAI:main' into pr/pdf-shared-extractor

5aeb496

gwokhou marked this pull request as ready for review June 23, 2026 13:58

gwokhou changed the title ~~Share PDF page extraction across PDF ingest paths~~ feat: Share PDF page extraction across PDF ingest paths Jun 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Share PDF page extraction across PDF ingest paths#136

feat: Share PDF page extraction across PDF ingest paths#136
gwokhou wants to merge 5 commits into
VectifyAI:mainfrom
gwokhou:pr/pdf-shared-extractor

gwokhou commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gwokhou commented Jun 23, 2026

Summary

Reviewer Note

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant