Skip to content

Integrate opendataloader-pdf into OCR chain; document unsloth, OpenViking, openrag fit#4

Draft
Copilot wants to merge 2 commits intomainfrom
copilot/evaluate-project-fit
Draft

Integrate opendataloader-pdf into OCR chain; document unsloth, OpenViking, openrag fit#4
Copilot wants to merge 2 commits intomainfrom
copilot/evaluate-project-fit

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 20, 2026

Evaluates four external OSS tools for fit with Project Aether and acts on the ones that integrate cleanly.

Integrated: @opendataloader/pdf

Slots into the PDF extraction fallback chain as step 2 (after Docling Serve, before pdf-parse). No server required — runs locally via the Java CLI wrapper. Currently the #1 open-source PDF benchmark result; handles multi-column, tables, and formulas that pdf-parse misses.

Docling Serve → OpenDataLoader PDF → pdf-parse → DeepSeek VL2 OCR → Tesseract.js

Key implementation details in lib/ocr.ts:

  • Filename sanitized via basename() before writing to tmpdir (path traversal prevention)
  • Isolated per-call tmpdir; cleanup errors surfaced in non-production environments
  • Skip entirely via OPENDATALOADER_ENABLED=false

Documented only

Tool Verdict Reason
unsloth Companion tool Python-only; use offline to fine-tune Ollama models, export GGUF, load via ollama create
OpenViking Optional future layer Filesystem-based context DB with HTTP API; complements LEANN/pgvector for long-running agent sessions
openrag Reference only Langflow + Docling + OpenSearch — mirrors existing stack; no additive value

README "Ecosystem tools" section added for all four. CLAUDE.md updated to reflect actual repo state.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

@vercel
Copy link
Copy Markdown

vercel Bot commented Mar 20, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
aether Ready Ready Preview, Comment Mar 20, 2026 7:14am

…tem tool fit

Co-authored-by: HarryBMa <163435833+HarryBMa@users.noreply.github.com>
Copilot AI changed the title [WIP] Assess project compatibility with current objectives Integrate opendataloader-pdf into OCR chain; document unsloth, OpenViking, openrag fit Mar 20, 2026
Copilot AI requested a review from HarryBMa March 20, 2026 07:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants