Skip to content

Vink0217/HireLens

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

103 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HireLens

Project Overview

HireLens is an AI resume screening system built for Sprinto's AI Implementation Intern assignment. It parses PDF and DOCX resumes, extracts structured candidate fields from configurable schemas, scores candidate fit against job descriptions with evidence, flags duplicate submissions per role, supports re-parsing when extraction schemas change, and provides multi-role ranking and RAG-backed evidence snippets for recruiter review.

Live Demo

Features

Minimum Requirements

Feature Description
Multi-format ingestion Accepts and parses .pdf and .docx resumes through a single upload endpoint and UI flow.
Dynamic extraction Extracts configurable fields such as name, years of experience, primary skills, and last job title from resume text.
Configurability Supports field schema editing from the dashboard config UI and persists extraction configuration in the database.
AI scoring with justification Scores each resume against a JD on a 1-10 scale and returns summary, strengths, gaps, confidence, and confidence reasoning.
Duplicate detection Prevents duplicate uploads for the same role using deterministic content hashing and DB-level uniqueness constraints.
Live deployment Frontend and backend are deployed to public URLs for end-to-end evaluation.

Enhanced Features

Feature Description
Batch re-parsing Triggers background re-extraction for resumes tied to a config after schema updates.
Multi-role ranking Scores one resume against multiple job descriptions in parallel and returns ranked fit output.
RAG evidence retrieval Returns the most relevant resume chunks for a JD using embeddings similarity, with lexical fallback when embeddings are unavailable.

System Architecture

The backend is a FastAPI service that runs an explicit upload pipeline: validate file type and size, parse resume text with format-specific parsers, compute a duplicate hash scoped to role, upload binary files to storage, run LLM extraction with the active schema, run JD scoring, then persist resume and screening records to PostgreSQL. Parsing uses a two-stage strategy per format to handle noisy documents: pdfplumber with PyMuPDF fallback for PDF and python-docx with Mammoth fallback for DOCX. The LLM integration uses Gemini Flash Lite through a single client wrapper and separates extraction and scoring into different prompt builders so each step can enforce a clear output contract. RAG evidence is generated by chunking parsed resume text into semantic windows, embedding chunks and job description text, and ranking chunks by cosine similarity, with lexical overlap fallback if embedding calls fail. Duplicate detection combines normalized content identity with job context and enforces uniqueness at the database layer to avoid race-condition duplicates. The frontend is a Next.js App Router application that consumes backend REST endpoints via Axios and SWR; candidate views, role comparison, config editing, upload, and evidence retrieval all use the same API layer and cache revalidation flow.

Dataflow Diagram

flowchart TD
    U[Recruiter uploads resume] --> A[Frontend Dashboard<br/>Next.js + SWR]
    A --> B[POST /api/resumes/upload]

    subgraph ING[Ingestion and Validation]
        B --> C[Validate size and extension]
        C --> D{File type}
        D -->|PDF| E[PDF parser<br/>pdfplumber → PyMuPDF fallback]
        D -->|DOCX| F[DOCX parser<br/>python-docx → Mammoth fallback]
        E --> G[Cleaned raw text]
        F --> G
        G --> H[Dedup hash<br/>identifier + job_id]
    end

    H --> I{Duplicate for role?}
    I -->|Yes| J[409 Conflict<br/>Return existing match]
    I -->|No| K[Store file URL<br/>Cloudinary or local]

    subgraph AIP[AI Pipeline]
        K --> L[Load active extraction config]
        L --> M[Gemini extractor<br/>Schema-driven JSON]
        M --> N[Gemini scorer<br/>score + summary + evidence]
    end

    N --> O[(PostgreSQL)]
    O --> P[resumes.extracted_data]
    O --> Q[screenings score and summary]
    P --> R[Dashboard views]
    Q --> R

    S[Config updated<br/>/dashboard/config] --> T[POST /api/configs/config_id/rescan]
    T --> V[Background task<br/>batch_reparse]
    V --> W[Re-extract each resume]
    W --> X[Re-score linked screenings]
    X --> O
    O --> Y[GET /api/configs/config_id/rescan-status<br/>progress and outcome]
    Y --> R

    classDef frontend fill:#eef2ff,stroke:#4f46e5,color:#111827,stroke-width:1px;
    classDef backend fill:#ecfeff,stroke:#0891b2,color:#0f172a,stroke-width:1px;
    classDef ai fill:#ecfccb,stroke:#65a30d,color:#1f2937,stroke-width:1px;
    classDef db fill:#fff7ed,stroke:#ea580c,color:#1f2937,stroke-width:1px;
    classDef warning fill:#fef2f2,stroke:#dc2626,color:#7f1d1d,stroke-width:1px;

    class A,R frontend;
    class B,C,D,E,F,G,H,K,L,S,T,V,W,X,Y backend;
    class M,N ai;
    class O,P,Q db;
    class I,J warning;
Loading

Architecture Note

A brief architecture note written for evaluators is available at docs/ARCHITECTURE_NOTE.md. It explains LLM selection, prompting strategy, file parsing decisions, and implementation approach end to end.

Prompting Strategy

The system uses two distinct prompts because extraction and evaluation have different failure modes. The extraction prompt is schema-driven and generated from the active config fields. It asks the model to return strict JSON only, maps each requested field to a clear extraction instruction, and forces explicit nulls when a value is missing. This reduces hallucinated fields and keeps downstream UI rendering deterministic because every field key is present in a known shape. The extraction layer also avoids free-form narrative outputs and does not accept markdown wrappers by design.

The scoring prompt takes JD text, parsed resume text, and extracted profile data as inputs and enforces a rubric that maps scores to defined fit bands. It requires evidence-backed strengths and gaps and includes confidence plus confidence reasoning in the response schema. This prompt structure limits generic scoring language by requiring concrete references from candidate content. Hallucination risk is reduced by grounding the scoring context in both full resume text and extracted profile values, and by forcing structured output fields that must be internally consistent. Confidence is treated as an explicit model output that indicates uncertainty due to sparse or ambiguous evidence; this allows recruiters to distinguish low-signal candidates from clear matches without relying only on the numeric score.

Tech Stack

Layer Technology Reason for choice
Frontend Next.js 16, React 19, TypeScript App Router patterns, typed UI components, predictable client routing and rendering.
Frontend data layer SWR, Axios Cache-aware API calls with straightforward mutation and revalidation behavior.
Backend API FastAPI, Uvicorn Async HTTP handling, typed request models, clear endpoint organization.
LLM Google Gemini Flash Lite (google-genai) Fast structured inference for extraction and scoring flows.
Resume parsing pdfplumber, PyMuPDF, python-docx, Mammoth Better coverage for mixed formatting and parser fallback behavior.
Database PostgreSQL (Neon), asyncpg Relational model for jobs, resumes, screenings, configs, and chunk metadata.
Embeddings Gemini embedding model (text-embedding-004) Unified provider for generation and embeddings.
File storage Cloudinary (with local fallback) Offloads binary storage from DB and supports URL-based access.
Backend dependency management uv + pyproject.toml + uv.lock Reproducible Python environments and dependency locking.
Frontend package/runtime Bun Fast install and script execution for Next.js app workflow.

Getting Started

Prerequisites

  • Python 3.11.13 or newer
  • Bun 1.x
  • PostgreSQL connection string (Neon recommended)
  • Gemini API key
  • Cloudinary URL (optional for cloud storage, local fallback works without it)

Environment Variables

Create backend/.env based on backend/.env.example:

# === Required ===
GEMINI_API_KEY=

# === Database (leave empty to use local fallback if configured) ===
DATABASE_URL=

# === File Storage (leave empty to use local filesystem fallback) ===
CLOUDINARY_URL=

# === Optional ===
MAX_FILE_SIZE_MB=10
CORS_ORIGINS=http://localhost:3000
GEMINI_EMBEDDING_MODEL=models/text-embedding-004

Create frontend/.env.local:

NEXT_PUBLIC_API_URL=http://localhost:8000/api

Installation

# from repository root
cd backend
uv sync
# in a new terminal
cd frontend
bun install

Run Locally

# terminal 1
cd backend
uv run uvicorn main:app --host 0.0.0.0 --port 8000 --reload
# terminal 2
cd frontend
bun run dev

Backend: http://localhost:8000
Frontend: http://localhost:3000

Configuration

Extraction fields are configurable from the dashboard at /dashboard/config, where each field has a key, type, and extraction instruction. The same schema can also be managed through the config API. A typical custom config payload looks like this:

{
	"name": "Engineering Default",
	"fields": [
		{
			"name": "full_name",
			"type": "string",
			"description": "Extract the candidate's full legal name from the top of the resume."
		},
		{
			"name": "years_of_experience",
			"type": "number",
			"description": "Estimate total professional years from work history dates."
		},
		{
			"name": "primary_skills",
			"type": "list",
			"description": "Extract core technical skills explicitly listed or evidenced by projects."
		}
	],
	"is_default": true
}

API Reference

Health

Method Path Request Body Response
GET /api/health None { "status": "ok", "service": "hirelens-backend" }

Jobs

Method Path Request Body Response Shape
POST /api/jobs { "title": "...", "description": "...", "company": "..." } Job object
GET /api/jobs None Job[] with resume_count and top_score
GET /api/jobs/{job_id} None Job object plus candidates[]
DELETE /api/jobs/{job_id} None 204 No Content

Resumes

Method Path Request Body Response Shape
POST /api/resumes/upload multipart/form-data with file, job_id, optional config_id { resume, screening, extracted_data }
GET /api/resumes Optional job_id query Resume[]
GET /api/resumes/{resume_id} None Full resume record including raw_text
GET /api/resumes/{resume_id}/download None File stream with normalized filename and MIME type
GET /api/resumes/{resume_id}/rag-evidence?job_id=<id>&top_k=5 None { resume_id, job_id, chunks[] }
POST /api/resumes/multi-role { "resume_id": "...", "job_ids": ["...", "..."] } Ranked multi-role scoring result array
DELETE /api/resumes/{resume_id} None { "message": "Resume deleted successfully" }

Configs

Method Path Request Body Response Shape
POST /api/configs { "name": "...", "fields": [...], "is_default": false } Config object
GET /api/configs None Config[]
GET /api/configs/{config_id} None Config object
PUT /api/configs/{config_id} { "name": "...", "fields": [...] } Updated config object
POST /api/configs/{config_id}/rescan None { "message": "Re-extraction started in background", "config_id": "..." }

Handling Edge Cases and Ambiguity

Unreadable files are handled at parsing time with explicit parser fallback paths and user-facing 422 errors when content cannot be extracted. Unsupported uploads are rejected early based on extension validation and size checks, which avoids unnecessary LLM calls. Resumes with weak date signals still pass through extraction and scoring; date-dependent fields can remain null or estimated based on prompt instructions, and confidence output is used to communicate uncertainty. Duplicate uploads for the same role are prevented by deterministic content hashing and DB-level uniqueness checks, then surfaced as 409 Conflict with reference details for the existing record. For ambiguities not fully defined in the assignment, decisions were made in favor of deterministic API contracts and recruiter visibility: strict JSON prompt outputs, explicit scoring rubric, and traceable strengths and gaps instead of free-form narrative evaluation.

Decision Log

Ambiguous Requirement Decision Reasoning
Resume identity for duplicates Hash normalized resume identity and scope by job_id Prevent duplicate submissions per role while allowing one candidate to apply to multiple roles
Handling missing dates for experience Return null for extraction where date evidence is absent; keep scoring with confidence annotation Avoid fabricated tenure values and preserve reviewer trust in extracted data
Corrupted or unreadable PDFs Use parser fallback chain, then return explicit 422 parse error Keeps pipeline deterministic and prevents silent low-quality outputs
Prompt shape for extraction vs scoring Separate prompts and response schemas Extraction and evaluation have different failure modes and validation needs
Multi-role ranking failures from model load Retry transient 503/UNAVAILABLE responses before failing Improves demo reliability without changing scoring rubric
RAG evidence retrieval strategy On-demand chunk ranking with embedding similarity and lexical fallback Enables evidence display even when embedding service is temporarily unavailable

Known Limitations

This implementation does not include OCR for image-only resumes. Such files are rejected when text extraction yields insufficient content. LLM availability can affect scoring latency and may return transient 503 errors during provider-side load spikes, though retries are implemented for multi-role scoring. RAG evidence is currently computed on demand from parsed text and is not yet persisted as a full background indexing pipeline across all resumes. Authentication and role-based access control are not implemented, so the current API assumes a trusted environment.

Project Structure

HireLens/
├── backend/
│   ├── main.py                     # FastAPI app entrypoint and router registration
│   ├── pyproject.toml              # Python dependencies and project metadata
│   ├── uv.lock                     # Locked Python dependency graph
│   ├── .env.example                # Backend environment template
│   ├── api/                        # REST endpoints (health, jobs, resumes, configs)
│   ├── core/                       # Runtime settings and app-level config
│   ├── db/                         # Database access layer and query helpers
│   ├── parsers/                    # PDF and DOCX parsing + text normalization
│   ├── services/
│   │   ├── llm/                    # Prompt builders, Gemini client, ranker, embedder
│   │   ├── storage/                # Cloudinary/local file storage adapter
│   │   └── tasks/                  # Background tasks such as config reparse
│   ├── tests/                      # Backend tests and smoke checks
│   └── uploads/                    # Local file storage fallback directory
├── frontend/
│   ├── package.json                # Frontend scripts and dependencies
│   ├── bun.lock                    # Bun lockfile
│   ├── src/
│   │   ├── app/                    # Next.js app routes (dashboard, jobs, candidates, config)
│   │   ├── components/             # Reusable UI components
│   │   └── lib/api.ts              # Axios API client and endpoint helpers
│   └── public/                     # Static assets
├── .gitignore
└── README.md                       # Project documentation

About

LLM-powered resume screener with configurable extraction, semantic matching (RAG), and automated candidate ranking.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors