Build an auto-complete system using N-gram language models trained on Twitter data.
This project implements a complete NLP pipeline:
- Part A: Data preprocessing (tokenization, vocabulary, OOV handling)
- Part B: N-gram language models (1-gram to 6-gram)
- Part C: Model evaluation (perplexity scores)
- Part D: Auto-complete system (word suggestions)
pip install -r requirements.txtpython -m src.mainOr in Google Colab:
exec(open('SUBMISSION.py').read())NLP-AutoComplete-Project/
├── SUBMISSION.py # Main code
├── README.md
├── requirements.txt
├── .gitignore
├── data/
│ ├── raw/
│ │ └── training.1600000.processed.noemoticon.csv
│ └── processed/
└── src/
├── part_a_preprocessing/
├── part_b_language_model/
├── part_c_evaluation/
└── part_d_autocomplete/
Sentiment140: 1.6M tweets
- Download: https://www.kaggle.com/datasets/kazanova/sentiment140
- Format: CSV with text in column 5
Typical output:
Train: 40,000 | Test: 10,000 | Vocab: 5,901
1-gram perplexity: 319.59
2-gram perplexity: 289.55
3-gram perplexity: 1256.35
'i love' → ['you', 'to', 'the', 'it', 'my']
'how are you' → ["'re", 'have', '!', 'are', '.']
Roshane Shahbaz (220444)
Mehvish Fatima
Natural Language Processing