OptiHire is an NLP-based resume screening and ranking system that automatically matches candidate resumes against a job description using semantic similarity and keyword alignment. It is designed to reduce manual recruiter workload and minimize unconscious bias by anonymizing personally identifiable information before any scoring takes place. The system processes resumes in batch, scores each one using a hybrid model combining Sentence-BERT embeddings and spaCy keyword extraction, and outputs a ranked leaderboard of the most qualified candidates.
The pipeline runs each resume through six stages:
- Raw resume text is loaded from the Kaggle Resume Dataset CSV
- PII redaction removes names, locations, and dates using spaCy NER
- Text normalization lowercases, lemmatizes, and strips stopwords
- SBERT encodes the cleaned text into a 384-dimensional embedding vector
- Keyword extraction pulls nouns and proper nouns using spaCy POS tagging
- A weighted final score is computed as 70% semantic similarity + 30% keyword overlap
Python 3.8 or higher is required.
A GPU is strongly recommended for SBERT encoding. The notebook is configured to run on Google Colab with a T4 GPU. CPU-only execution is supported but will be significantly slower on large batches.
Install all dependencies by running:
pip install torch tensorflow transformers nltk spacy pymupdf scrapingbee kaggle sentence-transformers pandasThen download the spaCy language model:
python -m spacy download en_core_web_mdThe following credentials must be configured as secrets in your Google Colab environment under the Secrets tab:
| Secret Name | Description |
|---|---|
KAGGLE_USERNAME |
Your Kaggle account username |
KAGGLE_KEY |
Your Kaggle API key, found at kaggle.com/account |
SCRAPINGBEE_KEY |
Your ScrapingBee API key for live job scraping (optional) |
To get your Kaggle credentials, go to kaggle.com, click on your profile, select Account, and scroll down to the API section to generate a new token.
Upload OptiHireDataScraper.ipynb to Google Colab. Before running any cells, go to Runtime > Change runtime type and select GPU as the hardware accelerator.
In the left sidebar, click the key icon to open the Secrets panel. Add your KAGGLE_USERNAME, KAGGLE_KEY, and optionally your SCRAPINGBEE_KEY as individual secrets.
Run the first cell to install all required packages. This only needs to be done once per Colab session.
This cell reads your Kaggle credentials from the Secrets panel and sets them as environment variables so the Kaggle CLI can authenticate.
This cell downloads and unzips the Kaggle Resume Dataset. It will appear as Resume/Resume.csv in your Colab file system.
This cell loads the spaCy language model and defines the clean_and_anonymize() function used for PII redaction and text normalization.
This cell reads the dataset and processes the first 50 resumes through the anonymization and normalization pipeline. The results are stored in a list called processed_data. You can increase the batch size by changing .head(50) to a larger number.
If you have a ScrapingBee API key, this cell scrapes a live job posting from SimplyHired and extracts the job description. If you do not have a key or prefer to skip this step, a fallback sample job description is used automatically. You can also upload your own resume PDF as my_resume.pdf to the Colab file browser for individual evaluation.
This cell loads the SBERT model (all-MiniLM-L6-v2) and defines the get_match_score() function. It runs a sample test against the first 5 processed resumes and prints their match scores.
This cell adds keyword extraction on top of the base SBERT score. It defines refined_match_score() and demonstrates the difference between the baseline semantic score and the improved hybrid score on a sample resume.
This is the final output cell. It ranks all processed resumes against the job description and displays the top 10 candidates in a formatted leaderboard table, sorted by final score descending.
To evaluate a specific resume PDF against a job description:
- Upload your PDF file to the Colab file browser and rename it
my_resume.pdf - Run Cell 6, which will detect the file and encode it as base64 for evaluation
- Pass the encoded resume into
refined_match_score()alongside any job description text
OptiHireDataScraper.ipynb Main notebook containing the full pipeline
README.md This file
my_resume.pdf Optional personal resume for individual evaluation
Resume/
Resume.csv Downloaded Kaggle dataset (generated at runtime)
The system currently has no ground-truth relevance labels, meaning evaluation is heuristic rather than formally measured with precision or recall. The 70/30 weighting between semantic and keyword scores was chosen based on observed ranking quality and can be adjusted in the refined_match_score() function. The spaCy NER model may occasionally miss informal name formats or unconventional resume structures, and the keyword component does not account for domain synonyms such as Keras and TensorFlow being related technologies.