Add PDFParserTool#51
Conversation
|
@lpayne-dev Could you update the body of the PR like we have for other PRs. Thanks |
| class PDFParserToolInputSchema(InputSchema): | ||
| """Input schema for PDF parsing.""" | ||
|
|
||
| url_or_path: str = Field(..., description="HTTP(S) URL or local filesystem path to a PDF") |
There was a problem hiding this comment.
clarify what do you mean by local file system path
There was a problem hiding this comment.
I was under the initial impression that the server runs on the user's own machine, is that incorrect?
If incorrect, we can remove the local file path option, as it is just a holdover from my benchmarking process.
There was a problem hiding this comment.
By local filesystem path, I mean a path that is resolvable by the AKD server process at runtime (e.g., /data/docs/a.pdf or C:\docs\a.pdf when server runs there). If the server is remote, a path on the client machine is not accessible; in that case HTTP(S) URL or accessible file:// URI should be used.
There was a problem hiding this comment.
Agreed! As the MCP server will stay away from the local path, it'll be good to reflect that.
| return as_uri | ||
|
|
||
| async def _run_akd_simple(url_or_path: str) -> ScraperToolOutputSchema: | ||
| scraper = SimplePDFScraper() |
There was a problem hiding this comment.
provide optional config and pass that as well. Since each scrpaer is also driven by config.
scraper = SimplePDFScraper(config=config)
There was a problem hiding this comment.
I’ll update both scraper paths to accept optional config and pass it through during initialization so behavior is fully config-driven.
| async def _run_akd_simple(url_or_path: str) -> ScraperToolOutputSchema: | ||
| scraper = SimplePDFScraper() | ||
| params = scraper.input_schema(url=_normalize_url_or_path(url_or_path)) | ||
| return await scraper._arun(params) |
There was a problem hiding this comment.
always use arun(...) which is the public-facing interface not _arun(...).
| class PDFParserTool(BaseTool[PDFParserToolInputSchema, PDFParserToolOutputSchema]): | ||
| """Parse PDFs into LLM-ready content using AKD core backends.""" | ||
|
|
||
| input_schema = PDFParserToolInputSchema | ||
| output_schema = PDFParserToolOutputSchema | ||
|
|
||
| async def _arun(self, params: PDFParserToolInputSchema) -> PDFParserToolOutputSchema: | ||
| backend = params.backend_hint | ||
| if backend is None: | ||
| backend = "akd_simple" if params.mode == "fast" else "akd_docling" | ||
|
|
||
| if backend == "akd_simple": | ||
| result = _scraper_to_result(await _run_akd_simple(params.url_or_path)) | ||
| elif backend == "akd_docling": | ||
| result = _scraper_to_result(await _run_akd_docling(params.url_or_path, params.mode)) | ||
| else: | ||
| raise ValueError(f"Unsupported backend: {backend!r}") | ||
|
|
||
| metadata = result.get("metadata", {}) | ||
| if not isinstance(metadata, dict): | ||
| metadata = {"raw_metadata": metadata} | ||
| metadata["backend"] = backend | ||
| metadata["return_format"] = params.return_format | ||
|
|
||
| return PDFParserToolOutputSchema( | ||
| content=str(result.get("content", "") or ""), | ||
| metadata=metadata, | ||
| ) |
There was a problem hiding this comment.
Should we also consider adding general scraper tool as well. akd core already has composite scraper where we can pass scraper objects.
Maybe like GeneralScraperTool or something will be good addition as well and the compsoite scraper will handle url webpage, pdf, antyhing direclty as well.
Switch PDF parser scraper calls from private _arun to public arun and add optional config passthrough for SimplePDFScraper and DoclingScraper paths to align with reviewer guidance.
| backend = "akd_simple" if params.mode == "fast" else "akd_docling" | ||
|
|
||
| if backend == "akd_simple": | ||
| result = _scraper_to_result(await _run_akd_simple(params.url_or_path)) |
There was a problem hiding this comment.
config are not passed to the _run_akd_simple and _run_akd_docling while the signature allow them.
Might be a good idea to put config defination via BaseToolConfig extension and pass them if available in tool config.
PDFParserToolConfig allows optional configuration for SimplePDFScraper and DoclingScraper within the PDFParserTool. Updates tests to validate the new configuration handling.
This reverts commit eb328f0.
Description:
Summary
Adds a new
PDFParserTooltoakd-extthat parses PDF files from URLs or local paths using AKD core scraper backends, with mode-based routing (fast,accurate,ocr). This includes MCP registration/export wiring and focused tests for backend selection, error handling, and path normalization.What it does
Introduces 1 new tool under
akd_ext/tools/:akd_ext/tools/pdf_parser.py): Parses a PDF into LLM-ready text plus metadata.Adds typed schemas for the tool:
PDFParserToolInputSchema: supportsurl_or_path,mode, optionalbackend_hint, andreturn_formathint (markdown/html/json).PDFParserToolOutputSchema: returns normalizedcontentandmetadata.Implements backend behavior:
fastdefaults toakd_simple(SimplePDFScraper).accurateandocrdefault toakd_docling(DoclingScraper) with config tuned per mode.backend_hintcan explicitly select backend.Adds local-path normalization logic:
Registers and exports tool APIs:
PDFParserTool,PDFParserToolInputSchema, andPDFParserToolOutputSchematoakd_ext/tools/__init__.py.@mcp_tool.Adds tests at
tests/tools/test_pdf_parser.pycovering:fast->akd_simplefast->akd_doclingHow it does this
BaseTool+InputSchema/OutputSchemaconventions used inakd-ext.SimplePDFScraper,DoclingScraper) behind one consistent tool interface.content+ enrichedmetadata(backend,return_format, parser metadata).Testing
uv run pytest tests/tools/test_pdf_parser.pyuv run pre-commit run --all-files(ruff/format/lint checks)