Build a system that intelligently extracts and ranks relevant sections from a collection of PDFs based on:
- A given persona
- A specific job-to-be-done
json { "metadata": { "input_documents": ["sample1.pdf", "sample2.pdf"], "persona": "Investment Analyst", "job_to_be_done": "Analyze market positioning strategies", "processing_timestamp": "2025-07-28T14:22:01" }, "extracted_sections": [ { "document": "sample1.pdf", "section_title": "Market Strategy", "importance_rank": 1, "page_number": 4 } ], "sub_section_analysis": [ { "document": "sample1.pdf", "refined_text": "The company is focusing on cloud-first verticals...", "page_number": 4 } ] }
π§ Methodology Extraction: Parse all PDFs and extract text with page metadata.
Chunking: Split documents into title-content sections.
Embedding: Encode persona and job into semantic vector (via sentence-transformers).
Ranking: Score each section using cosine similarity (title and content separately).
Summarization: Condense top content chunks using keyword-based summarization.
Structured Output: Combine all into the expected JSON schema.
π§± Tech Stack Python 3.11
PyMuPDF
spaCy
scikit-learn
sentence-transformers
π³ Docker Instructions
π¨ Build
bash
Copy
Edit
docker build --platform linux/amd64 -t personaextractor:latest .
-v "$(pwd)/data/input_docs:/app/data/input_docs"
-v "$(pwd)/data:/app/data"
--network none
personaextractor:latest
Input: data/input_docs/ β PDF documents
Output: data/output.json
β Constraints Satisfied Constraint Status β€ 60s for 3β5 PDFs β CPU only β Model size β€ 1GB β Offline / no internet β
π Project Structure css Copy Edit . βββ main.py βββ requirements.txt βββ Dockerfile βββ modules/ β βββ extractor.py β βββ chunker.py β βββ embedder.py β βββ ranker.py β βββ summarizer.py βββ data/ βββ input_docs/ βββ output.json π§© Notes Sentence-transformer embedding allows persona-job semantic intent to guide section relevance.
Summary generation is designed to be fast and simple (offline, rule-based).
yaml Copy Edit