This project is a simple yet complete demonstration of how to classify documents into categories such as:
- 📝
contract - 📄
invoice - 🏥
medical_record
It uses TF-IDF + Logistic Regression under the hood and includes:
- A FastAPI-based backend
- A Streamlit-based frontend
- Auto-training logic if no model is present
- PDF and raw text classification support
- ✅ Train from scratch (only if no model exists)
- ✅ Classify PDF or plain text files
- ✅ Expose REST API for integration
- ✅ Streamlit UI for easy testing
- ✅ Clean and modular project structure
document-classifier/
├── app/ # FastAPI app
│ ├── main.py
│ ├── model.py
│ └── schemas.py
├── classifier/ # Model training and preprocessing
│ ├── train_model.py
│ ├── preprocess.py
│ └── sample_dataset.csv
├── model/ # Trained model file
│ └── document_classifier_model.joblib
├── streamlit_app/ # Streamlit interface
│ └── app.py
├── requirements.txt
└── README.md
- Install dependencies
pip install -r requirements.txt- Run the FastAPI backend
uvicorn app.main:app --reload- Run the Streamlit app (in another terminal)
streamlit run streamlit_app/app.py- Python 3
- Scikit-learn
- FastAPI
- Streamlit
- PyMuPDF (for reading PDFs)
Send raw text to get classification result. Here few examples:
Contract:
This contract is made between the Company and the Contractor, and outlines the scope of services.
Invoice:
Invoice #45678 - Amount Due: $1,200.00 - Due Date: July 15, 2025
Medical Record:
The patient presents with chronic lower back pain and has a history of herniated disc.
{
"text": "This contract is made between the Company and the Contractor, and outlines the scope of services."
}Upload a PDF file and get a predicted label.

