Add PDFParserTool by lpayne-dev · Pull Request #51 · NASA-IMPACT/akd-ext

lpayne-dev · 2026-03-30T20:10:19Z

Description:

Summary

Adds a new PDFParserTool to akd-ext that parses PDF files from URLs or local paths using AKD core scraper backends, with mode-based routing (fast, accurate, ocr). This includes MCP registration/export wiring and focused tests for backend selection, error handling, and path normalization.

What it does

Introduces 1 new tool under akd_ext/tools/:

PDF Parser Tool (akd_ext/tools/pdf_parser.py): Parses a PDF into LLM-ready text plus metadata.

Adds typed schemas for the tool:

PDFParserToolInputSchema: supports url_or_path, mode, optional backend_hint, and return_format hint (markdown/html/json).
PDFParserToolOutputSchema: returns normalized content and metadata.

Implements backend behavior:

fast defaults to akd_simple (SimplePDFScraper).
accurate and ocr default to akd_docling (DoclingScraper) with config tuned per mode.
Optional backend_hint can explicitly select backend.

Adds local-path normalization logic:

Handles URL/file URI inputs directly.
Normalizes local paths; preserves Windows local path behavior for scraper compatibility.

Registers and exports tool APIs:

Adds PDFParserTool, PDFParserToolInputSchema, and PDFParserToolOutputSchema to akd_ext/tools/__init__.py.
Tool is discoverable via MCP registry through @mcp_tool.

Adds tests at tests/tools/test_pdf_parser.py covering:

Default routing for fast -> akd_simple
Default routing for non-fast -> akd_docling
Unsupported backend error path
MCP registry inclusion
Windows local path normalization

How it does this

Follows existing BaseTool + InputSchema/OutputSchema conventions used in akd-ext.
Wraps AKD core scraper implementations (SimplePDFScraper, DoclingScraper) behind one consistent tool interface.
Applies mode-to-backend routing with optional explicit backend override.
Normalizes scraper output into a stable response contract: content + enriched metadata (backend, return_format, parser metadata).

Testing

uv run pytest tests/tools/test_pdf_parser.py
uv run pre-commit run --all-files (ruff/format/lint checks)

NISH1001 · 2026-03-30T20:56:10Z

@lpayne-dev Could you update the body of the PR like we have for other PRs. Thanks

sanzog03 · 2026-03-30T21:12:00Z

+class PDFParserToolInputSchema(InputSchema):
+    """Input schema for PDF parsing."""
+
+    url_or_path: str = Field(..., description="HTTP(S) URL or local filesystem path to a PDF")


clarify what do you mean by local file system path

I was under the initial impression that the server runs on the user's own machine, is that incorrect?

If incorrect, we can remove the local file path option, as it is just a holdover from my benchmarking process.

By local filesystem path, I mean a path that is resolvable by the AKD server process at runtime (e.g., /data/docs/a.pdf or C:\docs\a.pdf when server runs there). If the server is remote, a path on the client machine is not accessible; in that case HTTP(S) URL or accessible file:// URI should be used.

Agreed! As the MCP server will stay away from the local path, it'll be good to reflect that.

NISH1001 · 2026-03-31T15:47:53Z

+    return as_uri
+
+async def _run_akd_simple(url_or_path: str) -> ScraperToolOutputSchema:
+    scraper = SimplePDFScraper()


provide optional config and pass that as well. Since each scrpaer is also driven by config.

scraper = SimplePDFScraper(config=config)

I’ll update both scraper paths to accept optional config and pass it through during initialization so behavior is fully config-driven.

NISH1001 · 2026-03-31T15:48:17Z

+async def _run_akd_simple(url_or_path: str) -> ScraperToolOutputSchema:
+    scraper = SimplePDFScraper()
+    params = scraper.input_schema(url=_normalize_url_or_path(url_or_path))
+    return await scraper._arun(params)


always use arun(...) which is the public-facing interface not _arun(...).

NISH1001 · 2026-03-31T15:50:28Z

+class PDFParserTool(BaseTool[PDFParserToolInputSchema, PDFParserToolOutputSchema]):
+    """Parse PDFs into LLM-ready content using AKD core backends."""
+
+    input_schema = PDFParserToolInputSchema
+    output_schema = PDFParserToolOutputSchema
+
+    async def _arun(self, params: PDFParserToolInputSchema) -> PDFParserToolOutputSchema:
+        backend = params.backend_hint
+        if backend is None:
+            backend = "akd_simple" if params.mode == "fast" else "akd_docling"
+
+        if backend == "akd_simple":
+            result = _scraper_to_result(await _run_akd_simple(params.url_or_path))
+        elif backend == "akd_docling":
+            result = _scraper_to_result(await _run_akd_docling(params.url_or_path, params.mode))
+        else:
+            raise ValueError(f"Unsupported backend: {backend!r}")
+
+        metadata = result.get("metadata", {})
+        if not isinstance(metadata, dict):
+            metadata = {"raw_metadata": metadata}
+        metadata["backend"] = backend
+        metadata["return_format"] = params.return_format
+
+        return PDFParserToolOutputSchema(
+            content=str(result.get("content", "") or ""),
+            metadata=metadata,
+        )


Should we also consider adding general scraper tool as well. akd core already has composite scraper where we can pass scraper objects.

Maybe like GeneralScraperTool or something will be good addition as well and the compsoite scraper will handle url webpage, pdf, antyhing direclty as well.

Switch PDF parser scraper calls from private _arun to public arun and add optional config passthrough for SimplePDFScraper and DoclingScraper paths to align with reviewer guidance.

sanzog03 · 2026-04-03T16:53:34Z

+            backend = "akd_simple" if params.mode == "fast" else "akd_docling"
+
+        if backend == "akd_simple":
+            result = _scraper_to_result(await _run_akd_simple(params.url_or_path))


config are not passed to the _run_akd_simple and _run_akd_docling while the signature allow them.
Might be a good idea to put config defination via BaseToolConfig extension and pass them if available in tool config.

PDFParserToolConfig allows optional configuration for SimplePDFScraper and DoclingScraper within the PDFParserTool. Updates tests to validate the new configuration handling.

This reverts commit eb328f0.

Add PDFParserTool with tests and tool exports

23969df

lpayne-dev requested a review from sanzog03 March 30, 2026 20:10

sanzog03 reviewed Mar 30, 2026

View reviewed changes

NISH1001 requested changes Mar 31, 2026

View reviewed changes

Use public arun for PDF scrapers

5174f23

Switch PDF parser scraper calls from private _arun to public arun and add optional config passthrough for SimplePDFScraper and DoclingScraper paths to align with reviewer guidance.

sanzog03 reviewed Apr 3, 2026

View reviewed changes

lpayne-dev added 3 commits April 6, 2026 12:23

Adds PDFParserToolConfig for enhanced configuration options

0e9f505

PDFParserToolConfig allows optional configuration for SimplePDFScraper and DoclingScraper within the PDFParserTool. Updates tests to validate the new configuration handling.

Add docling dependency and update package configurations

eb328f0

Revert "Add docling dependency and update package configurations"

976045d

This reverts commit eb328f0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PDFParserTool#51

Add PDFParserTool#51
lpayne-dev wants to merge 5 commits into
developfrom
feature/pdf-parser

lpayne-dev commented Mar 30, 2026 •

edited

Loading

Uh oh!

NISH1001 commented Mar 30, 2026

Uh oh!

sanzog03 Mar 30, 2026

Uh oh!

lpayne-dev Mar 31, 2026

Uh oh!

lpayne-dev Mar 31, 2026

Uh oh!

sanzog03 Apr 3, 2026

Uh oh!

NISH1001 Mar 31, 2026

Uh oh!

lpayne-dev Mar 31, 2026

Uh oh!

NISH1001 Mar 31, 2026

Uh oh!

NISH1001 Mar 31, 2026

Uh oh!

sanzog03 Apr 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lpayne-dev commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What it does

How it does this

Testing

Uh oh!

NISH1001 commented Mar 30, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanzog03 Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lpayne-dev commented Mar 30, 2026 •

edited

Loading

sanzog03 Apr 3, 2026 •

edited

Loading