Round 1B – Persona-Driven Document Intelligence

🎯 Objective

Build a system that intelligently extracts and ranks relevant sections from a collection of PDFs based on:

A given persona
A specific job-to-be-done

🧾 Output JSON Format

json { "metadata": { "input_documents": ["sample1.pdf", "sample2.pdf"], "persona": "Investment Analyst", "job_to_be_done": "Analyze market positioning strategies", "processing_timestamp": "2025-07-28T14:22:01" }, "extracted_sections": [ { "document": "sample1.pdf", "section_title": "Market Strategy", "importance_rank": 1, "page_number": 4 } ], "sub_section_analysis": [ { "document": "sample1.pdf", "refined_text": "The company is focusing on cloud-first verticals...", "page_number": 4 } ] }

🧠 Methodology Extraction: Parse all PDFs and extract text with page metadata.

Chunking: Split documents into title-content sections.

Embedding: Encode persona and job into semantic vector (via sentence-transformers).

Ranking: Score each section using cosine similarity (title and content separately).

Summarization: Condense top content chunks using keyword-based summarization.

Structured Output: Combine all into the expected JSON schema.

🧱 Tech Stack Python 3.11

PyMuPDF

spaCy

scikit-learn

sentence-transformers

🐳 Docker Instructions 🔨 Build bash Copy Edit docker build --platform linux/amd64 -t personaextractor:latest . ▶️ Run bash Copy Edit docker run --rm
-v "$(pwd)/data/input_docs:/app/data/input_docs"
-v "$(pwd)/data:/app/data"
--network none
personaextractor:latest Input: data/input_docs/ → PDF documents

Output: data/output.json

✅ Constraints Satisfied Constraint Status ≤ 60s for 3–5 PDFs ✅ CPU only ✅ Model size ≤ 1GB ✅ Offline / no internet ✅

📁 Project Structure css Copy Edit . ├── main.py ├── requirements.txt ├── Dockerfile ├── modules/ │ ├── extractor.py │ ├── chunker.py │ ├── embedder.py │ ├── ranker.py │ └── summarizer.py └── data/ ├── input_docs/ └── output.json 🧩 Notes Sentence-transformer embedding allows persona-job semantic intent to guide section relevance.

Summary generation is designed to be fast and simple (offline, rule-based).

yaml Copy Edit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Round 1B – Persona-Driven Document Intelligence

🎯 Objective

🧾 Output JSON Format

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
modules		modules
Dockerfile		Dockerfile
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Round 1B – Persona-Driven Document Intelligence

🎯 Objective

🧾 Output JSON Format

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages