Skip to content

siddharthayed/Round1B

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Round 1B – Persona-Driven Document Intelligence

🎯 Objective

Build a system that intelligently extracts and ranks relevant sections from a collection of PDFs based on:

  • A given persona
  • A specific job-to-be-done

🧾 Output JSON Format

json { "metadata": { "input_documents": ["sample1.pdf", "sample2.pdf"], "persona": "Investment Analyst", "job_to_be_done": "Analyze market positioning strategies", "processing_timestamp": "2025-07-28T14:22:01" }, "extracted_sections": [ { "document": "sample1.pdf", "section_title": "Market Strategy", "importance_rank": 1, "page_number": 4 } ], "sub_section_analysis": [ { "document": "sample1.pdf", "refined_text": "The company is focusing on cloud-first verticals...", "page_number": 4 } ] }

🧠 Methodology Extraction: Parse all PDFs and extract text with page metadata.

Chunking: Split documents into title-content sections.

Embedding: Encode persona and job into semantic vector (via sentence-transformers).

Ranking: Score each section using cosine similarity (title and content separately).

Summarization: Condense top content chunks using keyword-based summarization.

Structured Output: Combine all into the expected JSON schema.

🧱 Tech Stack Python 3.11

PyMuPDF

spaCy

scikit-learn

sentence-transformers

🐳 Docker Instructions πŸ”¨ Build bash Copy Edit docker build --platform linux/amd64 -t personaextractor:latest . ▢️ Run bash Copy Edit docker run --rm
-v "$(pwd)/data/input_docs:/app/data/input_docs"
-v "$(pwd)/data:/app/data"
--network none
personaextractor:latest Input: data/input_docs/ β†’ PDF documents

Output: data/output.json

βœ… Constraints Satisfied Constraint Status ≀ 60s for 3–5 PDFs βœ… CPU only βœ… Model size ≀ 1GB βœ… Offline / no internet βœ…

πŸ“ Project Structure css Copy Edit . β”œβ”€β”€ main.py β”œβ”€β”€ requirements.txt β”œβ”€β”€ Dockerfile β”œβ”€β”€ modules/ β”‚ β”œβ”€β”€ extractor.py β”‚ β”œβ”€β”€ chunker.py β”‚ β”œβ”€β”€ embedder.py β”‚ β”œβ”€β”€ ranker.py β”‚ └── summarizer.py └── data/ β”œβ”€β”€ input_docs/ └── output.json 🧩 Notes Sentence-transformer embedding allows persona-job semantic intent to guide section relevance.

Summary generation is designed to be fast and simple (offline, rule-based).

yaml Copy Edit

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors