NLP Auto-Complete Project

Build an auto-complete system using N-gram language models trained on Twitter data.

Overview

This project implements a complete NLP pipeline:

Part A: Data preprocessing (tokenization, vocabulary, OOV handling)
Part B: N-gram language models (1-gram to 6-gram)
Part C: Model evaluation (perplexity scores)
Part D: Auto-complete system (word suggestions)

Setup

pip install -r requirements.txt

Run

python -m src.main

Or in Google Colab:

exec(open('SUBMISSION.py').read())

Project Structure

NLP-AutoComplete-Project/
├── SUBMISSION.py              # Main code
├── README.md
├── requirements.txt
├── .gitignore
├── data/
│   ├── raw/
│   │   └── training.1600000.processed.noemoticon.csv
│   └── processed/
└── src/
    ├── part_a_preprocessing/
    ├── part_b_language_model/
    ├── part_c_evaluation/
    └── part_d_autocomplete/

Dataset

Sentiment140: 1.6M tweets

Download: https://www.kaggle.com/datasets/kazanova/sentiment140
Format: CSV with text in column 5

Results

Typical output:

Train: 40,000 | Test: 10,000 | Vocab: 5,901

1-gram perplexity: 319.59
2-gram perplexity: 289.55
3-gram perplexity: 1256.35

'i love' → ['you', 'to', 'the', 'it', 'my']
'how are you' → ["'re", 'have', '!', 'are', '.']

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data/processed		data/processed
notebooks		notebooks
results		results
src		src
README.md		README.md
final_demo.py		final_demo.py
gitignore		gitignore
index.html		index.html
requirements.txt		requirements.txt
script.js		script.js
style.css		style.css

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP Auto-Complete Project

Overview

Setup

Run

Project Structure

Dataset

Results

Authors

Instructor

Course

Semester Project

Deadline: 17 May, 2026

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NLP Auto-Complete Project

Overview

Setup

Run

Project Structure

Dataset

Results

Authors

Instructor

Course

Semester Project

Deadline: 17 May, 2026

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages