Skip to content

siddharthayed/Round1A

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Round 1A – Structured Outline Extraction

πŸš€ Challenge Overview

Extract a structured outline from any PDF document. This includes:

  • Title
  • Headings (H1, H2, H3)
  • Page numbers

Output format: json { "title": "Understanding AI", "outline": [ { "level": "H1", "text": "Introduction", "page": 1 }, { "level": "H2", "text": "What is AI?", "page": 2 }, { "level": "H3", "text": "History of AI", "page": 3 } ] } 🧠 Approach We developed a robust pipeline that uses both visual and semantic cues to extract headings from PDFs. The process includes:

Text Extraction: Extract raw text and layout metadata using PyMuPDF (fitz).

Candidate Identification: Use heuristics (title casing, font size, numbering) and NER (spaCy) to propose heading candidates.

Matching: Fuzzy match candidates against extracted spans with font/style metadata.

Filtering: Remove repeated headers/footers using positional clustering.

Merging: Merge adjacent heading spans with same styling.

Scoring & Ranking: Compute importance score using:

Font size

Font weight

Box enclosure (e.g., if it's inside a bounding box)

Tagging: Top-scored spans are tagged as H1/H2/H3 accordingly.

Output: Structured JSON is generated with proper hierarchy.

🧱 Tech Stack Python 3.11

PyMuPDF

spaCy (en_core_web_sm)

🐳 Docker Instructions πŸ”¨ Build bash Copy Edit docker build --platform linux/amd64 -t headingextractor:latest . ▢️ Run bash Copy Edit docker run --rm
-v "$(pwd)/input:/app/input"
-v "$(pwd)/output:/app/output"
--network none
headingextractor:latest The container will:

Read all PDFs from /app/input

Write JSON outlines to /app/output

βœ… Constraints Satisfied Constraint Status ≀ 10s for 50-page PDF βœ… CPU only βœ… Model size ≀ 200MB βœ… Offline (no network) βœ… Output JSON format βœ…

πŸ“ Directory Structure css Copy Edit . β”œβ”€β”€ round1A.py β”œβ”€β”€ main.py β”œβ”€β”€ requirements.txt β”œβ”€β”€ Dockerfile β”œβ”€β”€ input/ └── output/ πŸ“Œ Notes Works for a wide range of document layouts.

No hardcoded logic or file-specific assumptions.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors