Skip to content

Czesare/CV-Generator

Repository files navigation

CV Generator — LLM-Powered CV Reformatter

NLP Assignment 3 | Master AI — Natural Language Processing 2026


Course Context

This project is built for Assignment 3: LLM Language Application & Evaluation (Option 1). The assignment asks students to design a language-based application using an LLM, with methodology informed by course theory (prompting strategies, model behaviour, human-LLM interaction), and to critically evaluate the system's reliability and consistency.


The Idea

Getting a CV into a clean, professional format is tedious. You have to parse your own writing, restructure it into sections, and then make it look good on paper.

CV Generator automates this with an LLM pipeline: upload your existing CV (or paste raw text), the model extracts all your information into a structured schema, you fill in any gaps or correct mistakes through a guided UI, and the app generates a polished PDF via LaTeX — with bullet points optionally rewritten to remove AI-sounding phrasing.


What We Built

Pipeline

Input (PDF / DOCX / raw text)
        ↓
  [1] Text extraction          extraction.py — pdfplumber / python-docx
        ↓
  [2] LLM structured extraction  GPT-4o-mini → schema.json
        ↓
  [3] Human QA & editing       Streamlit UI (app.py / qa.py)
        ↓
  [4] Bullet humanization      GPT-4o-mini rewrites AI phrasing
        ↓
  [5] LaTeX rendering          Jinja2 → cv_template.tex
        ↓
  [6] Faithfulness check       key facts verified in .tex source
        ↓
  [7] PDF compilation          pdflatex → downloadable PDF

Key Files

File Role
app.py Streamlit web UI — 4-stage flow (upload → QA → preview → download)
extraction.py Text extraction from PDF/DOCX + LLM-based structured parsing
qa.py CLI alternative for gap-filling interactively
generation.py Bullet humanization, LaTeX escaping, template rendering, PDF compilation, faithfulness check
schema.json CV data schema (personal, education, experience, projects, skills, languages)
templates/cv_template.tex Jinja2-annotated LaTeX CV template
eval/run_extraction_eval.py Automated extraction evaluation against ground-truth test cases
eval/results/ Eval report (JSON + text) from the last evaluation run
test_cases/ 5 synthetic test CVs (3 standard profiles + 2 challenging cases) with ground-truth JSON

Design Decisions

Model: GPT-4o-mini — chosen for cost-efficiency and speed. Structured extraction is a well-scoped task that does not require the reasoning depth of larger models. JSON mode (response_format={"type": "json_object"}) guarantees parseable output.

Temperature: 0 for extraction (determinism, reproducibility) and faithfulness judging; 0.4 for bullet rewriting (some creativity while staying factual).

Prompting strategy: The extraction prompt explicitly distinguishes skills.languages (programming languages) from languages (spoken languages) — a common LLM confusion point — and instructs the model to use null rather than hallucinate missing fields. The bullet rewriter is given a banned-phrase list ("leveraged", "spearheaded", "synergy" etc.) and a hard rule to not add or remove facts.

Faithfulness check: After rendering the LaTeX template, key facts (name, email, all institution/company names) are verified to be present in the source before PDF compilation. Discrepancies are surfaced as warnings in the UI.

Human-in-the-loop: The QA stage lets users correct extraction errors, add missing entries, or fill fields the LLM could not find. This hybrid design reduces dependence on perfect LLM output.


Evaluation

The extraction pipeline is evaluated with eval/run_extraction_eval.py against 5 synthetic test CVs:

  • cv_cs_graduate — standard CS profile
  • cv_ai_graduate — AI/ML researcher profile
  • cv_business_graduate — non-technical business profile
  • cv_challenging_1_formatting — poorly formatted, inconsistent structure
  • cv_challenging_2_content — mixed-language CV (Dutch/English), vague job descriptions

Each test case has a hand-crafted ground-truth JSON. Fields are evaluated with 4 labels:

Label Meaning
PASS Correct or semantically equivalent
FAIL Meaningfully wrong
MISS Ground truth has a value, model extracted null/empty
HALLUCINATION Ground truth is null/empty, model invented a value

Semantic fields (summaries, bullet points, institution names) use an LLM-as-judge (GPT-4o-mini, temperature=0). Dates and skill lists use exact set matching.

Results (last run)

CV Score
cv_cs_graduate 32/53 (60.4%)
cv_ai_graduate 33/54 (61.1%)
cv_business_graduate 34/57 (59.6%)
cv_challenging_1_formatting 22/59 (37.3%)
cv_challenging_2_content 31/49 (63.3%)
Overall 152/272 (55.9%)

By category:

Category Score
personal 40/40 (100%)
skills 17/20 (85%)
experience 53/110 (48%)
projects 22/52 (42%)
education 20/50 (40%)

Personal fields (name, email, phone, location) are extracted perfectly. The main failure mode is systematic MISS on dates, roles, and project names — the model extracts the surrounding text but not these structured sub-fields when the CV format deviates from expectations. The challenging formatting case (cv_challenging_1_formatting) additionally triggers hallucination on extra bullet points the model infers from vague text. Zero hallucinations on well-structured CVs.


Setup & Usage

Prerequisites

  • Python 3.11+
  • An OpenAI API key
  • pdflatex for PDF generation: brew install --cask basictex (macOS) or apt install texlive (Linux)

Install

cd CV_LLM_project
python -m venv venv
source venv/bin/activate        # Windows: venv\Scripts\activate
pip install -r requirements.txt

Configure

Create a .env file in CV_LLM_project/:

OPENAI_API_KEY=sk-...

Or enter the key directly in the app sidebar.

Run the web app

streamlit run app.py

The app opens at http://localhost:8501. Steps:

  1. Upload — drop a PDF/DOCX or paste raw text (or start from an empty form)
  2. Fill gaps — review extracted data, correct mistakes, add missing entries
  3. Preview & generate — optionally render a live preview, then generate the PDF
  4. Download — download your formatted CV

Run the evaluation

python eval/run_extraction_eval.py

Results are written to eval/results/eval_report.txt and eval/results/eval_report.json.

CLI usage (no Streamlit)

# Extract structured data from a CV file
python extraction.py path/to/cv.pdf -o extracted.json

# Fill missing fields interactively
python qa.py extracted.json -o complete.json

# Generate PDF
python generation.py complete.json -o output/

About

LLM-powered CV generation pipeline. Upload your CV, extract structured info, fill gaps via Q&A, and get a reformatted PDF out. Built for a VU Amsterdam NLP assignment.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors