feat: PDF ingestion via unpdf with page chunking#33
feat: PDF ingestion via unpdf with page chunking#33RaviTharuma wants to merge 2 commits intoAVIDS2:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b0e39c413c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
src/multimodal/pdf-loader.ts
Outdated
| */ | ||
| export async function extractPdfText(input: PdfInput): Promise<PdfExtractionResult> { | ||
| // Dynamic import — unpdf is optional | ||
| const unpdf = await import('unpdf').catch(() => null) as { |
There was a problem hiding this comment.
Declare unpdf as an optional dependency
This new loader hard-depends on resolving import('unpdf'), but the package is not declared in Memorix’s dependency metadata, so default installs won’t provide it. In the documented global-install flow (npm install -g memorix), telling users to run local npm install unpdf still won’t make this import resolvable from the global package tree, so memorix_ingest_pdf will consistently fail for those users.
Useful? React with 👍 / 👎.
| const limit = Math.min(rawPages.length, maxPages); | ||
|
|
||
| for (let i = 0; i < limit; i++) { |
There was a problem hiding this comment.
Enforce integer maxPages before page loop
maxPages is treated as an arbitrary number and used directly in i < limit, which causes fractional values to round up in practice (e.g., maxPages=1.1 processes pages 1 and 2). Negative values also silently result in zero ingestion. Because this tool parameter represents a page count, it should be validated/coerced to a positive integer before computing the loop bound.
Useful? React with 👍 / 👎.
tests/multimodal/pdf-loader.test.ts
Outdated
| @@ -0,0 +1,32 @@ | |||
| import { describe, it, expect } from 'bun:test'; | |||
There was a problem hiding this comment.
|
Thanks for working on PDF ingestion — page-level storage is a good direction, but this branch still has a few hard blockers.
So at the moment this is not just a dependency-install issue — it needs one more pass to make the optional dependency story and the repo's type/test expectations line up. |
|
All three are valid — I'll fix them:
The optional dependency story for global installs is the part that needs the most care — I'll make sure the error path when |
b0e39c4 to
cf9bd93
Compare
|
Addressed review feedback:
|
AVIDS2
left a comment
There was a problem hiding this comment.
This still has at least one real blocker before merge. The optional unpdf story is not finished yet: the branch still does not declare unpdf in package metadata, and the runtime error path tells users to run local npm install unpdf, which will not make import('unpdf') resolvable for the documented global-install flow. There is also still no integer validation on maxPages, so fractional or negative values can produce surprising page-count behavior. Please tighten the dependency/install contract and validate maxPages as a positive integer before we take this.
Summary
Adds an
memorix_ingest_pdfMCP tool that extracts text from PDF files usingunpdfand stores each page (or chunk) as a memorix observation.What's included
src/multimodal/pdf-loader.ts— PDF text extractor with:unpdfdependency (clear error if not installed)src/server.ts— MCP tool registration (memorix_ingest_pdf)tests/multimodal/pdf-loader.test.ts— 4 tests covering:Design decisions
unpdfis an optional peer dependency — tool returns clear error if not installed (npm install unpdf).how-it-worksobservation type with PDF metadata (page number, total pages, filename) in facts.unpdfis pure JS/WASM.Tests