Skip to content

feat: Share PDF page extraction across PDF ingest paths#136

Open
gwokhou wants to merge 5 commits into
VectifyAI:mainfrom
gwokhou:pr/pdf-shared-extractor
Open

feat: Share PDF page extraction across PDF ingest paths#136
gwokhou wants to merge 5 commits into
VectifyAI:mainfrom
gwokhou:pr/pdf-shared-extractor

Conversation

@gwokhou

@gwokhou gwokhou commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Summary

Refs #135.

This PR introduces a shared PDF page extraction contract and routes PDF ingest paths through it where doing so preserves existing behavior:

  • adds PageContent, PdfExtractor, LocalPdfExtractor, pages_to_markdown, and pages_to_json
  • routes short PDF Markdown conversion through the configured shared extractor
  • routes long PDF local fallback page JSON through the same extractor
  • preserves PageIndex Cloud get_page_content() as the first choice when PAGEINDEX_API_KEY is set, then falls back to the configured extractor
  • updates tests around PDF extraction, short PDF conversion, long PDF source JSON, and mocked async compile coroutines

Reviewer Note

Issue #135 asks for a configured PDF extraction engine to affect both short PDF Markdown and long PDF page JSON. This PR preserves the existing PageIndex Cloud page-content path when PAGEINDEX_API_KEY is set because that path provides cloud OCR behavior for scanned or complex PDFs.

That means pdf_parser controls short PDFs and long-PDF local fallback JSON, but does not override cloud-mode get_page_content() output. Please confirm whether preserving cloud OCR priority is the desired compatibility behavior, or whether #135 should require pdf_parser to control cloud-mode long PDF page JSON too.

Validation

UV_CACHE_DIR=/tmp/uv-cache UV_PYTHON=3.13 uv run --extra dev pytest tests/test_pdf_extractor.py tests/test_converter.py tests/test_indexer.py tests/test_config.py tests/test_add_command.py tests/test_remove.py tests/test_url_ingest.py

Result: 161 passed, 6 warnings in 1.35s.

@gwokhou gwokhou marked this pull request as ready for review June 23, 2026 13:58
@gwokhou gwokhou changed the title Share PDF page extraction across PDF ingest paths feat: Share PDF page extraction across PDF ingest paths Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant