This project was implemented as part of a take-home technical assignment. All data used is synthetic CSV data created solely for demonstration purposes.
The core objective is to match mentees to mentors based on:
- Research domain compatibility
- Subdomain similarity
The system is designed to be deterministic, locally executable, and easy to evaluate. No API keys or external services are required to run this project.
The project follows a modular architecture separating the matching logic from the presentation layer:
- Backend (Python 3.10+ / FastAPI): Handles CSV parsing, data validation, and core matching algorithms. Uses
scikit-learnfor semantic text analysis. - Frontend (Next.js / TypeScript): Provides a responsive interface for users to upload datasets and visualize matching results with confidence breakdowns.
- CLI Utility: Enables batch processing of matching tasks.
The API and frontend components are optional enhancements built to demonstrate how the core matching logic can be exposed and visualized.
The matching logic follows the assignment-specified weighted scoring system (0–100 scale) to ensure transparent and reproducible results. The total confidence score is derived from two components:
The system first evaluates the high-level research_domain field.
- Exact Match: Awards 70 points.
- Mismatch: Awards 0 points.
This strict filtering prioritizes matches fundamentally grounded in the same field of study, as per the assignment requirements.
Refines the match based on specific areas of interest (subdomain). The scoring cascades through the following logic:
- Exact Match (30 pts): Identical strings (case-insensitive).
- Containment (24 pts): One term is a substring of the other (e.g., "Vision" in "Computer Vision").
- Predefined Related Subdomains (21 pts): Uses mappings to capture common academic overlaps (e.g., NLP ↔ LLMs, Cryptography ↔ Network Security).
- Shared Categories (15 pts): Both subdomains map to a common parent category.
- Semantic Similarity (Variable): If no direct relationship is found, the system uses TF-IDF (Term Frequency-Inverse Document Frequency) and Cosine Similarity to calculate a vector distance between terms. This allows detection of related concepts even with different terminology.
The system produces a deterministic output file (JSON or CSV) containing the following key fields:
mentee_name: Name of the student.matched_mentor: Name of the assigned mentor.confidence_score: A numeric score (0-100) indicating match quality.match_reason: A human-readable explanation of why the match was made.
mentee_name,matched_mentor,confidence_score,match_reason
Aanya N,Dr. Sharma,95,"Same domain (AI) and same subdomain (Computer Vision)"- Python 3.10 or higher
- Node.js 18 or higher (for the optional frontend)
- npm or yarn
The backend handles the core logic.
-
Navigate to the backend directory:
cd backend -
Install Python dependencies:
pip install -r requirements.txt
-
Run the CLI (Recommended for evaluation): This allows you to match CSV files directly from the terminal.
python cli.py --mentees ../sample_mentees.csv --mentors ../sample_mentors.csv --output results.json
Arguments:
--mentees: Path to the mentees CSV file (Required)--mentors: Path to the mentors CSV file (Required)--output: Destination path for the result file (Default: output.json)--format: Output format, eitherjsonorcsv(Default: json)
-
(Optional) Run the API Server:
python main.py
The API will be available at
http://localhost:8000.
The frontend provides a graphical user interface for the system.
-
Navigate to the root directory (containing
package.json). -
Install dependencies:
npm install
-
Start the development server:
npm run dev
Access the application at
http://localhost:3000.
Accepts JSON payloads of mentees and mentors to process matches.
Request Body:
{
"mentees": [{"name": "Student A", "research_domain": "AI", "subdomain": "NLP", ...}],
"mentors": [{"name": "Prof B", "research_domain": "AI", "subdomain": "LLMs", ...}]
}Accepts multipart/form-data uploads of CSV files.
- Deterministic Logic: The matching service is designed to be deterministic, ensuring that the same input always produces the exact same output.
- Local Text Analysis: TF-IDF models are fitted locally to the provided batch. This ensures semantic matching works robustly on the specific vocabulary of the input dataset without external dependencies.
- Type Safety: Pydantic models (Backend) and TypeScript interfaces (Frontend) are strictly enforced to minimize runtime errors and ensure data integrity.