HireLens is an AI resume screening system built for Sprinto's AI Implementation Intern assignment. It parses PDF and DOCX resumes, extracts structured candidate fields from configurable schemas, scores candidate fit against job descriptions with evidence, flags duplicate submissions per role, supports re-parsing when extraction schemas change, and provides multi-role ranking and RAG-backed evidence snippets for recruiter review.
- Application URL: https://hire-lens-tau.vercel.app/
| Feature | Description |
|---|---|
| Multi-format ingestion | Accepts and parses .pdf and .docx resumes through a single upload endpoint and UI flow. |
| Dynamic extraction | Extracts configurable fields such as name, years of experience, primary skills, and last job title from resume text. |
| Configurability | Supports field schema editing from the dashboard config UI and persists extraction configuration in the database. |
| AI scoring with justification | Scores each resume against a JD on a 1-10 scale and returns summary, strengths, gaps, confidence, and confidence reasoning. |
| Duplicate detection | Prevents duplicate uploads for the same role using deterministic content hashing and DB-level uniqueness constraints. |
| Live deployment | Frontend and backend are deployed to public URLs for end-to-end evaluation. |
| Feature | Description |
|---|---|
| Batch re-parsing | Triggers background re-extraction for resumes tied to a config after schema updates. |
| Multi-role ranking | Scores one resume against multiple job descriptions in parallel and returns ranked fit output. |
| RAG evidence retrieval | Returns the most relevant resume chunks for a JD using embeddings similarity, with lexical fallback when embeddings are unavailable. |
The backend is a FastAPI service that runs an explicit upload pipeline: validate file type and size, parse resume text with format-specific parsers, compute a duplicate hash scoped to role, upload binary files to storage, run LLM extraction with the active schema, run JD scoring, then persist resume and screening records to PostgreSQL. Parsing uses a two-stage strategy per format to handle noisy documents: pdfplumber with PyMuPDF fallback for PDF and python-docx with Mammoth fallback for DOCX. The LLM integration uses Gemini Flash Lite through a single client wrapper and separates extraction and scoring into different prompt builders so each step can enforce a clear output contract. RAG evidence is generated by chunking parsed resume text into semantic windows, embedding chunks and job description text, and ranking chunks by cosine similarity, with lexical overlap fallback if embedding calls fail. Duplicate detection combines normalized content identity with job context and enforces uniqueness at the database layer to avoid race-condition duplicates. The frontend is a Next.js App Router application that consumes backend REST endpoints via Axios and SWR; candidate views, role comparison, config editing, upload, and evidence retrieval all use the same API layer and cache revalidation flow.
flowchart TD
U[Recruiter uploads resume] --> A[Frontend Dashboard<br/>Next.js + SWR]
A --> B[POST /api/resumes/upload]
subgraph ING[Ingestion and Validation]
B --> C[Validate size and extension]
C --> D{File type}
D -->|PDF| E[PDF parser<br/>pdfplumber → PyMuPDF fallback]
D -->|DOCX| F[DOCX parser<br/>python-docx → Mammoth fallback]
E --> G[Cleaned raw text]
F --> G
G --> H[Dedup hash<br/>identifier + job_id]
end
H --> I{Duplicate for role?}
I -->|Yes| J[409 Conflict<br/>Return existing match]
I -->|No| K[Store file URL<br/>Cloudinary or local]
subgraph AIP[AI Pipeline]
K --> L[Load active extraction config]
L --> M[Gemini extractor<br/>Schema-driven JSON]
M --> N[Gemini scorer<br/>score + summary + evidence]
end
N --> O[(PostgreSQL)]
O --> P[resumes.extracted_data]
O --> Q[screenings score and summary]
P --> R[Dashboard views]
Q --> R
S[Config updated<br/>/dashboard/config] --> T[POST /api/configs/config_id/rescan]
T --> V[Background task<br/>batch_reparse]
V --> W[Re-extract each resume]
W --> X[Re-score linked screenings]
X --> O
O --> Y[GET /api/configs/config_id/rescan-status<br/>progress and outcome]
Y --> R
classDef frontend fill:#eef2ff,stroke:#4f46e5,color:#111827,stroke-width:1px;
classDef backend fill:#ecfeff,stroke:#0891b2,color:#0f172a,stroke-width:1px;
classDef ai fill:#ecfccb,stroke:#65a30d,color:#1f2937,stroke-width:1px;
classDef db fill:#fff7ed,stroke:#ea580c,color:#1f2937,stroke-width:1px;
classDef warning fill:#fef2f2,stroke:#dc2626,color:#7f1d1d,stroke-width:1px;
class A,R frontend;
class B,C,D,E,F,G,H,K,L,S,T,V,W,X,Y backend;
class M,N ai;
class O,P,Q db;
class I,J warning;
A brief architecture note written for evaluators is available at docs/ARCHITECTURE_NOTE.md. It explains LLM selection, prompting strategy, file parsing decisions, and implementation approach end to end.
The system uses two distinct prompts because extraction and evaluation have different failure modes. The extraction prompt is schema-driven and generated from the active config fields. It asks the model to return strict JSON only, maps each requested field to a clear extraction instruction, and forces explicit nulls when a value is missing. This reduces hallucinated fields and keeps downstream UI rendering deterministic because every field key is present in a known shape. The extraction layer also avoids free-form narrative outputs and does not accept markdown wrappers by design.
The scoring prompt takes JD text, parsed resume text, and extracted profile data as inputs and enforces a rubric that maps scores to defined fit bands. It requires evidence-backed strengths and gaps and includes confidence plus confidence reasoning in the response schema. This prompt structure limits generic scoring language by requiring concrete references from candidate content. Hallucination risk is reduced by grounding the scoring context in both full resume text and extracted profile values, and by forcing structured output fields that must be internally consistent. Confidence is treated as an explicit model output that indicates uncertainty due to sparse or ambiguous evidence; this allows recruiters to distinguish low-signal candidates from clear matches without relying only on the numeric score.
| Layer | Technology | Reason for choice |
|---|---|---|
| Frontend | Next.js 16, React 19, TypeScript | App Router patterns, typed UI components, predictable client routing and rendering. |
| Frontend data layer | SWR, Axios | Cache-aware API calls with straightforward mutation and revalidation behavior. |
| Backend API | FastAPI, Uvicorn | Async HTTP handling, typed request models, clear endpoint organization. |
| LLM | Google Gemini Flash Lite (google-genai) |
Fast structured inference for extraction and scoring flows. |
| Resume parsing | pdfplumber, PyMuPDF, python-docx, Mammoth | Better coverage for mixed formatting and parser fallback behavior. |
| Database | PostgreSQL (Neon), asyncpg | Relational model for jobs, resumes, screenings, configs, and chunk metadata. |
| Embeddings | Gemini embedding model (text-embedding-004) |
Unified provider for generation and embeddings. |
| File storage | Cloudinary (with local fallback) | Offloads binary storage from DB and supports URL-based access. |
| Backend dependency management | uv + pyproject.toml + uv.lock |
Reproducible Python environments and dependency locking. |
| Frontend package/runtime | Bun | Fast install and script execution for Next.js app workflow. |
- Python
3.11.13or newer - Bun
1.x - PostgreSQL connection string (Neon recommended)
- Gemini API key
- Cloudinary URL (optional for cloud storage, local fallback works without it)
Create backend/.env based on backend/.env.example:
# === Required ===
GEMINI_API_KEY=
# === Database (leave empty to use local fallback if configured) ===
DATABASE_URL=
# === File Storage (leave empty to use local filesystem fallback) ===
CLOUDINARY_URL=
# === Optional ===
MAX_FILE_SIZE_MB=10
CORS_ORIGINS=http://localhost:3000
GEMINI_EMBEDDING_MODEL=models/text-embedding-004Create frontend/.env.local:
NEXT_PUBLIC_API_URL=http://localhost:8000/api# from repository root
cd backend
uv sync# in a new terminal
cd frontend
bun install# terminal 1
cd backend
uv run uvicorn main:app --host 0.0.0.0 --port 8000 --reload# terminal 2
cd frontend
bun run devBackend: http://localhost:8000
Frontend: http://localhost:3000
Extraction fields are configurable from the dashboard at /dashboard/config, where each field has a key, type, and extraction instruction. The same schema can also be managed through the config API. A typical custom config payload looks like this:
{
"name": "Engineering Default",
"fields": [
{
"name": "full_name",
"type": "string",
"description": "Extract the candidate's full legal name from the top of the resume."
},
{
"name": "years_of_experience",
"type": "number",
"description": "Estimate total professional years from work history dates."
},
{
"name": "primary_skills",
"type": "list",
"description": "Extract core technical skills explicitly listed or evidenced by projects."
}
],
"is_default": true
}| Method | Path | Request Body | Response |
|---|---|---|---|
| GET | /api/health |
None | { "status": "ok", "service": "hirelens-backend" } |
| Method | Path | Request Body | Response Shape |
|---|---|---|---|
| POST | /api/jobs |
{ "title": "...", "description": "...", "company": "..." } |
Job object |
| GET | /api/jobs |
None | Job[] with resume_count and top_score |
| GET | /api/jobs/{job_id} |
None | Job object plus candidates[] |
| DELETE | /api/jobs/{job_id} |
None | 204 No Content |
| Method | Path | Request Body | Response Shape |
|---|---|---|---|
| POST | /api/resumes/upload |
multipart/form-data with file, job_id, optional config_id |
{ resume, screening, extracted_data } |
| GET | /api/resumes |
Optional job_id query |
Resume[] |
| GET | /api/resumes/{resume_id} |
None | Full resume record including raw_text |
| GET | /api/resumes/{resume_id}/download |
None | File stream with normalized filename and MIME type |
| GET | /api/resumes/{resume_id}/rag-evidence?job_id=<id>&top_k=5 |
None | { resume_id, job_id, chunks[] } |
| POST | /api/resumes/multi-role |
{ "resume_id": "...", "job_ids": ["...", "..."] } |
Ranked multi-role scoring result array |
| DELETE | /api/resumes/{resume_id} |
None | { "message": "Resume deleted successfully" } |
| Method | Path | Request Body | Response Shape |
|---|---|---|---|
| POST | /api/configs |
{ "name": "...", "fields": [...], "is_default": false } |
Config object |
| GET | /api/configs |
None | Config[] |
| GET | /api/configs/{config_id} |
None | Config object |
| PUT | /api/configs/{config_id} |
{ "name": "...", "fields": [...] } |
Updated config object |
| POST | /api/configs/{config_id}/rescan |
None | { "message": "Re-extraction started in background", "config_id": "..." } |
Unreadable files are handled at parsing time with explicit parser fallback paths and user-facing 422 errors when content cannot be extracted. Unsupported uploads are rejected early based on extension validation and size checks, which avoids unnecessary LLM calls. Resumes with weak date signals still pass through extraction and scoring; date-dependent fields can remain null or estimated based on prompt instructions, and confidence output is used to communicate uncertainty. Duplicate uploads for the same role are prevented by deterministic content hashing and DB-level uniqueness checks, then surfaced as 409 Conflict with reference details for the existing record. For ambiguities not fully defined in the assignment, decisions were made in favor of deterministic API contracts and recruiter visibility: strict JSON prompt outputs, explicit scoring rubric, and traceable strengths and gaps instead of free-form narrative evaluation.
| Ambiguous Requirement | Decision | Reasoning |
|---|---|---|
| Resume identity for duplicates | Hash normalized resume identity and scope by job_id |
Prevent duplicate submissions per role while allowing one candidate to apply to multiple roles |
| Handling missing dates for experience | Return null for extraction where date evidence is absent; keep scoring with confidence annotation | Avoid fabricated tenure values and preserve reviewer trust in extracted data |
| Corrupted or unreadable PDFs | Use parser fallback chain, then return explicit 422 parse error |
Keeps pipeline deterministic and prevents silent low-quality outputs |
| Prompt shape for extraction vs scoring | Separate prompts and response schemas | Extraction and evaluation have different failure modes and validation needs |
| Multi-role ranking failures from model load | Retry transient 503/UNAVAILABLE responses before failing |
Improves demo reliability without changing scoring rubric |
| RAG evidence retrieval strategy | On-demand chunk ranking with embedding similarity and lexical fallback | Enables evidence display even when embedding service is temporarily unavailable |
This implementation does not include OCR for image-only resumes. Such files are rejected when text extraction yields insufficient content. LLM availability can affect scoring latency and may return transient 503 errors during provider-side load spikes, though retries are implemented for multi-role scoring. RAG evidence is currently computed on demand from parsed text and is not yet persisted as a full background indexing pipeline across all resumes. Authentication and role-based access control are not implemented, so the current API assumes a trusted environment.
HireLens/
├── backend/
│ ├── main.py # FastAPI app entrypoint and router registration
│ ├── pyproject.toml # Python dependencies and project metadata
│ ├── uv.lock # Locked Python dependency graph
│ ├── .env.example # Backend environment template
│ ├── api/ # REST endpoints (health, jobs, resumes, configs)
│ ├── core/ # Runtime settings and app-level config
│ ├── db/ # Database access layer and query helpers
│ ├── parsers/ # PDF and DOCX parsing + text normalization
│ ├── services/
│ │ ├── llm/ # Prompt builders, Gemini client, ranker, embedder
│ │ ├── storage/ # Cloudinary/local file storage adapter
│ │ └── tasks/ # Background tasks such as config reparse
│ ├── tests/ # Backend tests and smoke checks
│ └── uploads/ # Local file storage fallback directory
├── frontend/
│ ├── package.json # Frontend scripts and dependencies
│ ├── bun.lock # Bun lockfile
│ ├── src/
│ │ ├── app/ # Next.js app routes (dashboard, jobs, candidates, config)
│ │ ├── components/ # Reusable UI components
│ │ └── lib/api.ts # Axios API client and endpoint helpers
│ └── public/ # Static assets
├── .gitignore
└── README.md # Project documentation