HireLens

Project Overview

HireLens is an AI resume screening system built for Sprinto's AI Implementation Intern assignment. It parses PDF and DOCX resumes, extracts structured candidate fields from configurable schemas, scores candidate fit against job descriptions with evidence, flags duplicate submissions per role, supports re-parsing when extraction schemas change, and provides multi-role ranking and RAG-backed evidence snippets for recruiter review.

Live Demo

Application URL: https://hire-lens-tau.vercel.app/

Features

Minimum Requirements

Feature	Description
Multi-format ingestion	Accepts and parses `.pdf` and `.docx` resumes through a single upload endpoint and UI flow.
Dynamic extraction	Extracts configurable fields such as name, years of experience, primary skills, and last job title from resume text.
Configurability	Supports field schema editing from the dashboard config UI and persists extraction configuration in the database.
AI scoring with justification	Scores each resume against a JD on a 1-10 scale and returns summary, strengths, gaps, confidence, and confidence reasoning.
Duplicate detection	Prevents duplicate uploads for the same role using deterministic content hashing and DB-level uniqueness constraints.
Live deployment	Frontend and backend are deployed to public URLs for end-to-end evaluation.

Enhanced Features

Feature	Description
Batch re-parsing	Triggers background re-extraction for resumes tied to a config after schema updates.
Multi-role ranking	Scores one resume against multiple job descriptions in parallel and returns ranked fit output.
RAG evidence retrieval	Returns the most relevant resume chunks for a JD using embeddings similarity, with lexical fallback when embeddings are unavailable.

System Architecture

The backend is a FastAPI service that runs an explicit upload pipeline: validate file type and size, parse resume text with format-specific parsers, compute a duplicate hash scoped to role, upload binary files to storage, run LLM extraction with the active schema, run JD scoring, then persist resume and screening records to PostgreSQL. Parsing uses a two-stage strategy per format to handle noisy documents: pdfplumber with PyMuPDF fallback for PDF and python-docx with Mammoth fallback for DOCX. The LLM integration uses Gemini Flash Lite through a single client wrapper and separates extraction and scoring into different prompt builders so each step can enforce a clear output contract. RAG evidence is generated by chunking parsed resume text into semantic windows, embedding chunks and job description text, and ranking chunks by cosine similarity, with lexical overlap fallback if embedding calls fail. Duplicate detection combines normalized content identity with job context and enforces uniqueness at the database layer to avoid race-condition duplicates. The frontend is a Next.js App Router application that consumes backend REST endpoints via Axios and SWR; candidate views, role comparison, config editing, upload, and evidence retrieval all use the same API layer and cache revalidation flow.

Dataflow Diagram

flowchart TD
    U[Recruiter uploads resume] --> A[Frontend Dashboard<br/>Next.js + SWR]
    A --> B[POST /api/resumes/upload]

    subgraph ING[Ingestion and Validation]
        B --> C[Validate size and extension]
        C --> D{File type}
        D -->|PDF| E[PDF parser<br/>pdfplumber → PyMuPDF fallback]
        D -->|DOCX| F[DOCX parser<br/>python-docx → Mammoth fallback]
        E --> G[Cleaned raw text]
        F --> G
        G --> H[Dedup hash<br/>identifier + job_id]
    end

    H --> I{Duplicate for role?}
    I -->|Yes| J[409 Conflict<br/>Return existing match]
    I -->|No| K[Store file URL<br/>Cloudinary or local]

    subgraph AIP[AI Pipeline]
        K --> L[Load active extraction config]
        L --> M[Gemini extractor<br/>Schema-driven JSON]
        M --> N[Gemini scorer<br/>score + summary + evidence]
    end

    N --> O[(PostgreSQL)]
    O --> P[resumes.extracted_data]
    O --> Q[screenings score and summary]
    P --> R[Dashboard views]
    Q --> R

    S[Config updated<br/>/dashboard/config] --> T[POST /api/configs/config_id/rescan]
    T --> V[Background task<br/>batch_reparse]
    V --> W[Re-extract each resume]
    W --> X[Re-score linked screenings]
    X --> O
    O --> Y[GET /api/configs/config_id/rescan-status<br/>progress and outcome]
    Y --> R

    classDef frontend fill:#eef2ff,stroke:#4f46e5,color:#111827,stroke-width:1px;
    classDef backend fill:#ecfeff,stroke:#0891b2,color:#0f172a,stroke-width:1px;
    classDef ai fill:#ecfccb,stroke:#65a30d,color:#1f2937,stroke-width:1px;
    classDef db fill:#fff7ed,stroke:#ea580c,color:#1f2937,stroke-width:1px;
    classDef warning fill:#fef2f2,stroke:#dc2626,color:#7f1d1d,stroke-width:1px;

    class A,R frontend;
    class B,C,D,E,F,G,H,K,L,S,T,V,W,X,Y backend;
    class M,N ai;
    class O,P,Q db;
    class I,J warning;

Architecture Note

A brief architecture note written for evaluators is available at docs/ARCHITECTURE_NOTE.md. It explains LLM selection, prompting strategy, file parsing decisions, and implementation approach end to end.

Prompting Strategy

The system uses two distinct prompts because extraction and evaluation have different failure modes. The extraction prompt is schema-driven and generated from the active config fields. It asks the model to return strict JSON only, maps each requested field to a clear extraction instruction, and forces explicit nulls when a value is missing. This reduces hallucinated fields and keeps downstream UI rendering deterministic because every field key is present in a known shape. The extraction layer also avoids free-form narrative outputs and does not accept markdown wrappers by design.

The scoring prompt takes JD text, parsed resume text, and extracted profile data as inputs and enforces a rubric that maps scores to defined fit bands. It requires evidence-backed strengths and gaps and includes confidence plus confidence reasoning in the response schema. This prompt structure limits generic scoring language by requiring concrete references from candidate content. Hallucination risk is reduced by grounding the scoring context in both full resume text and extracted profile values, and by forcing structured output fields that must be internally consistent. Confidence is treated as an explicit model output that indicates uncertainty due to sparse or ambiguous evidence; this allows recruiters to distinguish low-signal candidates from clear matches without relying only on the numeric score.

Tech Stack

Layer	Technology	Reason for choice
Frontend	Next.js 16, React 19, TypeScript	App Router patterns, typed UI components, predictable client routing and rendering.
Frontend data layer	SWR, Axios	Cache-aware API calls with straightforward mutation and revalidation behavior.
Backend API	FastAPI, Uvicorn	Async HTTP handling, typed request models, clear endpoint organization.
LLM	Google Gemini Flash Lite (`google-genai`)	Fast structured inference for extraction and scoring flows.
Resume parsing	pdfplumber, PyMuPDF, python-docx, Mammoth	Better coverage for mixed formatting and parser fallback behavior.
Database	PostgreSQL (Neon), asyncpg	Relational model for jobs, resumes, screenings, configs, and chunk metadata.
Embeddings	Gemini embedding model (`text-embedding-004`)	Unified provider for generation and embeddings.
File storage	Cloudinary (with local fallback)	Offloads binary storage from DB and supports URL-based access.
Backend dependency management	`uv` + `pyproject.toml` + `uv.lock`	Reproducible Python environments and dependency locking.
Frontend package/runtime	Bun	Fast install and script execution for Next.js app workflow.

Getting Started

Prerequisites

Python 3.11.13 or newer
Bun 1.x
PostgreSQL connection string (Neon recommended)
Gemini API key
Cloudinary URL (optional for cloud storage, local fallback works without it)

Environment Variables

Create backend/.env based on backend/.env.example:

# === Required ===
GEMINI_API_KEY=

# === Database (leave empty to use local fallback if configured) ===
DATABASE_URL=

# === File Storage (leave empty to use local filesystem fallback) ===
CLOUDINARY_URL=

# === Optional ===
MAX_FILE_SIZE_MB=10
CORS_ORIGINS=http://localhost:3000
GEMINI_EMBEDDING_MODEL=models/text-embedding-004

Create frontend/.env.local:

NEXT_PUBLIC_API_URL=http://localhost:8000/api

Installation

# from repository root
cd backend
uv sync

# in a new terminal
cd frontend
bun install

Run Locally

# terminal 1
cd backend
uv run uvicorn main:app --host 0.0.0.0 --port 8000 --reload

# terminal 2
cd frontend
bun run dev

Backend: http://localhost:8000
Frontend: http://localhost:3000

Configuration

Extraction fields are configurable from the dashboard at /dashboard/config, where each field has a key, type, and extraction instruction. The same schema can also be managed through the config API. A typical custom config payload looks like this:

{
	"name": "Engineering Default",
	"fields": [
		{
			"name": "full_name",
			"type": "string",
			"description": "Extract the candidate's full legal name from the top of the resume."
		},
		{
			"name": "years_of_experience",
			"type": "number",
			"description": "Estimate total professional years from work history dates."
		},
		{
			"name": "primary_skills",
			"type": "list",
			"description": "Extract core technical skills explicitly listed or evidenced by projects."
		}
	],
	"is_default": true
}

API Reference

Health

Method	Path	Request Body	Response
GET	`/api/health`	None	`{ "status": "ok", "service": "hirelens-backend" }`

Jobs

Method	Path	Request Body	Response Shape
POST	`/api/jobs`	`{ "title": "...", "description": "...", "company": "..." }`	Job object
GET	`/api/jobs`	None	`Job[]` with `resume_count` and `top_score`
GET	`/api/jobs/{job_id}`	None	Job object plus `candidates[]`
DELETE	`/api/jobs/{job_id}`	None	`204 No Content`

Resumes

Method	Path	Request Body	Response Shape
POST	`/api/resumes/upload`	`multipart/form-data` with `file`, `job_id`, optional `config_id`	`{ resume, screening, extracted_data }`
GET	`/api/resumes`	Optional `job_id` query	`Resume[]`
GET	`/api/resumes/{resume_id}`	None	Full resume record including `raw_text`
GET	`/api/resumes/{resume_id}/download`	None	File stream with normalized filename and MIME type
GET	`/api/resumes/{resume_id}/rag-evidence?job_id=<id>&top_k=5`	None	`{ resume_id, job_id, chunks[] }`
POST	`/api/resumes/multi-role`	`{ "resume_id": "...", "job_ids": ["...", "..."] }`	Ranked multi-role scoring result array
DELETE	`/api/resumes/{resume_id}`	None	`{ "message": "Resume deleted successfully" }`

Configs

Method	Path	Request Body	Response Shape
POST	`/api/configs`	`{ "name": "...", "fields": [...], "is_default": false }`	Config object
GET	`/api/configs`	None	`Config[]`
GET	`/api/configs/{config_id}`	None	Config object
PUT	`/api/configs/{config_id}`	`{ "name": "...", "fields": [...] }`	Updated config object
POST	`/api/configs/{config_id}/rescan`	None	`{ "message": "Re-extraction started in background", "config_id": "..." }`

Handling Edge Cases and Ambiguity

Unreadable files are handled at parsing time with explicit parser fallback paths and user-facing 422 errors when content cannot be extracted. Unsupported uploads are rejected early based on extension validation and size checks, which avoids unnecessary LLM calls. Resumes with weak date signals still pass through extraction and scoring; date-dependent fields can remain null or estimated based on prompt instructions, and confidence output is used to communicate uncertainty. Duplicate uploads for the same role are prevented by deterministic content hashing and DB-level uniqueness checks, then surfaced as 409 Conflict with reference details for the existing record. For ambiguities not fully defined in the assignment, decisions were made in favor of deterministic API contracts and recruiter visibility: strict JSON prompt outputs, explicit scoring rubric, and traceable strengths and gaps instead of free-form narrative evaluation.

Decision Log

Ambiguous Requirement	Decision	Reasoning
Resume identity for duplicates	Hash normalized resume identity and scope by `job_id`	Prevent duplicate submissions per role while allowing one candidate to apply to multiple roles
Handling missing dates for experience	Return null for extraction where date evidence is absent; keep scoring with confidence annotation	Avoid fabricated tenure values and preserve reviewer trust in extracted data
Corrupted or unreadable PDFs	Use parser fallback chain, then return explicit `422` parse error	Keeps pipeline deterministic and prevents silent low-quality outputs
Prompt shape for extraction vs scoring	Separate prompts and response schemas	Extraction and evaluation have different failure modes and validation needs
Multi-role ranking failures from model load	Retry transient `503/UNAVAILABLE` responses before failing	Improves demo reliability without changing scoring rubric
RAG evidence retrieval strategy	On-demand chunk ranking with embedding similarity and lexical fallback	Enables evidence display even when embedding service is temporarily unavailable

Known Limitations

This implementation does not include OCR for image-only resumes. Such files are rejected when text extraction yields insufficient content. LLM availability can affect scoring latency and may return transient 503 errors during provider-side load spikes, though retries are implemented for multi-role scoring. RAG evidence is currently computed on demand from parsed text and is not yet persisted as a full background indexing pipeline across all resumes. Authentication and role-based access control are not implemented, so the current API assumes a trusted environment.

Project Structure

HireLens/
├── backend/
│   ├── main.py                     # FastAPI app entrypoint and router registration
│   ├── pyproject.toml              # Python dependencies and project metadata
│   ├── uv.lock                     # Locked Python dependency graph
│   ├── .env.example                # Backend environment template
│   ├── api/                        # REST endpoints (health, jobs, resumes, configs)
│   ├── core/                       # Runtime settings and app-level config
│   ├── db/                         # Database access layer and query helpers
│   ├── parsers/                    # PDF and DOCX parsing + text normalization
│   ├── services/
│   │   ├── llm/                    # Prompt builders, Gemini client, ranker, embedder
│   │   ├── storage/                # Cloudinary/local file storage adapter
│   │   └── tasks/                  # Background tasks such as config reparse
│   ├── tests/                      # Backend tests and smoke checks
│   └── uploads/                    # Local file storage fallback directory
├── frontend/
│   ├── package.json                # Frontend scripts and dependencies
│   ├── bun.lock                    # Bun lockfile
│   ├── src/
│   │   ├── app/                    # Next.js app routes (dashboard, jobs, candidates, config)
│   │   ├── components/             # Reusable UI components
│   │   └── lib/api.ts              # Axios API client and endpoint helpers
│   └── public/                     # Static assets
├── .gitignore
└── README.md                       # Project documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HireLens

Project Overview

Live Demo

Features

Minimum Requirements

Enhanced Features

System Architecture

Dataflow Diagram

Architecture Note

Prompting Strategy

Tech Stack

Getting Started

Prerequisites

Environment Variables

Installation

Run Locally

Configuration

API Reference

Health

Jobs

Resumes

Configs

Handling Edge Cases and Ambiguity

Decision Log

Known Limitations

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
backend		backend
frontend		frontend
.gitignore		.gitignore
ARCHITECTURE_NOTE.md		ARCHITECTURE_NOTE.md
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

HireLens

Project Overview

Live Demo

Features

Minimum Requirements

Enhanced Features

System Architecture

Dataflow Diagram

Architecture Note

Prompting Strategy

Tech Stack

Getting Started

Prerequisites

Environment Variables

Installation

Run Locally

Configuration

API Reference

Health

Jobs

Resumes

Configs

Handling Edge Cases and Ambiguity

Decision Log

Known Limitations

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages