Skip to content

Add PDFParserTool#51

Open
lpayne-dev wants to merge 5 commits into
developfrom
feature/pdf-parser
Open

Add PDFParserTool#51
lpayne-dev wants to merge 5 commits into
developfrom
feature/pdf-parser

Conversation

@lpayne-dev
Copy link
Copy Markdown
Collaborator

@lpayne-dev lpayne-dev commented Mar 30, 2026

Description:

Summary

Adds a new PDFParserTool to akd-ext that parses PDF files from URLs or local paths using AKD core scraper backends, with mode-based routing (fast, accurate, ocr). This includes MCP registration/export wiring and focused tests for backend selection, error handling, and path normalization.

What it does

Introduces 1 new tool under akd_ext/tools/:

  • PDF Parser Tool (akd_ext/tools/pdf_parser.py): Parses a PDF into LLM-ready text plus metadata.

Adds typed schemas for the tool:

  • PDFParserToolInputSchema: supports url_or_path, mode, optional backend_hint, and return_format hint (markdown/html/json).
  • PDFParserToolOutputSchema: returns normalized content and metadata.

Implements backend behavior:

  • fast defaults to akd_simple (SimplePDFScraper).
  • accurate and ocr default to akd_docling (DoclingScraper) with config tuned per mode.
  • Optional backend_hint can explicitly select backend.

Adds local-path normalization logic:

  • Handles URL/file URI inputs directly.
  • Normalizes local paths; preserves Windows local path behavior for scraper compatibility.

Registers and exports tool APIs:

  • Adds PDFParserTool, PDFParserToolInputSchema, and PDFParserToolOutputSchema to akd_ext/tools/__init__.py.
  • Tool is discoverable via MCP registry through @mcp_tool.

Adds tests at tests/tools/test_pdf_parser.py covering:

  • Default routing for fast -> akd_simple
  • Default routing for non-fast -> akd_docling
  • Unsupported backend error path
  • MCP registry inclusion
  • Windows local path normalization

How it does this

  • Follows existing BaseTool + InputSchema/OutputSchema conventions used in akd-ext.
  • Wraps AKD core scraper implementations (SimplePDFScraper, DoclingScraper) behind one consistent tool interface.
  • Applies mode-to-backend routing with optional explicit backend override.
  • Normalizes scraper output into a stable response contract: content + enriched metadata (backend, return_format, parser metadata).

Testing

  • uv run pytest tests/tools/test_pdf_parser.py
  • uv run pre-commit run --all-files (ruff/format/lint checks)

@lpayne-dev lpayne-dev requested a review from sanzog03 March 30, 2026 20:10
@NISH1001
Copy link
Copy Markdown
Collaborator

@lpayne-dev Could you update the body of the PR like we have for other PRs. Thanks

class PDFParserToolInputSchema(InputSchema):
"""Input schema for PDF parsing."""

url_or_path: str = Field(..., description="HTTP(S) URL or local filesystem path to a PDF")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarify what do you mean by local file system path

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the initial impression that the server runs on the user's own machine, is that incorrect?

If incorrect, we can remove the local file path option, as it is just a holdover from my benchmarking process.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By local filesystem path, I mean a path that is resolvable by the AKD server process at runtime (e.g., /data/docs/a.pdf or C:\docs\a.pdf when server runs there). If the server is remote, a path on the client machine is not accessible; in that case HTTP(S) URL or accessible file:// URI should be used.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! As the MCP server will stay away from the local path, it'll be good to reflect that.

Comment thread akd_ext/tools/pdf_parser.py Outdated
return as_uri

async def _run_akd_simple(url_or_path: str) -> ScraperToolOutputSchema:
scraper = SimplePDFScraper()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

provide optional config and pass that as well. Since each scrpaer is also driven by config.

scraper = SimplePDFScraper(config=config)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll update both scraper paths to accept optional config and pass it through during initialization so behavior is fully config-driven.

Comment thread akd_ext/tools/pdf_parser.py Outdated
async def _run_akd_simple(url_or_path: str) -> ScraperToolOutputSchema:
scraper = SimplePDFScraper()
params = scraper.input_schema(url=_normalize_url_or_path(url_or_path))
return await scraper._arun(params)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always use arun(...) which is the public-facing interface not _arun(...).

Comment on lines +86 to +113
class PDFParserTool(BaseTool[PDFParserToolInputSchema, PDFParserToolOutputSchema]):
"""Parse PDFs into LLM-ready content using AKD core backends."""

input_schema = PDFParserToolInputSchema
output_schema = PDFParserToolOutputSchema

async def _arun(self, params: PDFParserToolInputSchema) -> PDFParserToolOutputSchema:
backend = params.backend_hint
if backend is None:
backend = "akd_simple" if params.mode == "fast" else "akd_docling"

if backend == "akd_simple":
result = _scraper_to_result(await _run_akd_simple(params.url_or_path))
elif backend == "akd_docling":
result = _scraper_to_result(await _run_akd_docling(params.url_or_path, params.mode))
else:
raise ValueError(f"Unsupported backend: {backend!r}")

metadata = result.get("metadata", {})
if not isinstance(metadata, dict):
metadata = {"raw_metadata": metadata}
metadata["backend"] = backend
metadata["return_format"] = params.return_format

return PDFParserToolOutputSchema(
content=str(result.get("content", "") or ""),
metadata=metadata,
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also consider adding general scraper tool as well. akd core already has composite scraper where we can pass scraper objects.

Maybe like GeneralScraperTool or something will be good addition as well and the compsoite scraper will handle url webpage, pdf, antyhing direclty as well.

Switch PDF parser scraper calls from private _arun to public arun and add optional config passthrough for SimplePDFScraper and DoclingScraper paths to align with reviewer guidance.
Comment thread akd_ext/tools/pdf_parser.py Outdated
backend = "akd_simple" if params.mode == "fast" else "akd_docling"

if backend == "akd_simple":
result = _scraper_to_result(await _run_akd_simple(params.url_or_path))
Copy link
Copy Markdown
Collaborator

@sanzog03 sanzog03 Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config are not passed to the _run_akd_simple and _run_akd_docling while the signature allow them.
Might be a good idea to put config defination via BaseToolConfig extension and pass them if available in tool config.

PDFParserToolConfig allows optional configuration for SimplePDFScraper and DoclingScraper within the PDFParserTool.

Updates tests to validate the new configuration handling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants