Skip to content

demonayush11/interview-bot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

GPT from Scratch — Story Pretraining & Interview Bot Fine-tuning

A complete end-to-end notebook that builds a GPT-2-style language model from scratch, pretrains it on the TinyStories dataset, and then fine-tunes it into a functional Interview Bot using instruction-following data.

Built in Google Colab with a T4 GPU.


What This Notebook Does

Stage Description
1. Data Loading Loads the TinyStories dataset (10k–50k stories) from HuggingFace
2. Tokenization Builds a custom tokenizer, then upgrades to GPT-2's BPE tokenizer via tiktoken
3. Data Pipeline Sliding window dataset (GPTDatasetV1) with configurable stride and context length
4. Embeddings Token embeddings + positional embeddings from scratch
5. Attention Implements scaled dot-product self-attention, causal masking, and multi-head attention
6. GPT Model Full GPTModel built with transformer blocks, layer norm, and feed-forward layers
7. Pretraining Trains the GPT on TinyStories for next-token prediction
8. Fine-tuning Instruction fine-tunes the pretrained model on Alpaca-style Q&A data
9. Interview Bot Final model answers CS/Python questions in an ask() interface
10. Export Saves the model as gpt2_interview_bot.pth and gpt2_alpaca_base.pth

Project Structure

notebook/
├── Untitled4.ipynb          # Main notebook (this file)
├── gpt2_interview_bot.pth   # Fine-tuned model checkpoint
└── gpt2_alpaca_base.pth     # Pretrained base checkpoint (pre fine-tune)

Requirements

pip install datasets tiktoken torch
Library Purpose
torch Model building and training
tiktoken GPT-2 BPE tokenizer
datasets Loading TinyStories from HuggingFace

Model Architecture

The GPT model is built entirely from scratch, following the GPT-2 design:

  • Vocabulary size: 50,257 (GPT-2 BPE)
  • Context length: 256 tokens (pretraining) / 1024 tokens (fine-tuning)
  • Embedding dim: configurable (256 used for pretraining)
  • Attention: Multi-head causal self-attention with dropout
  • Blocks: Stacked transformer blocks with Layer Norm and GELU activation
  • Output: Linear projection back to vocabulary size

Stage 1 — Pretraining on TinyStories

The model is pretrained on 10,000–50,000 short children's stories from the TinyStories dataset.

Stories are joined with <|endoftext|> separators and fed through a sliding window dataloader (GPTDatasetV1) for next-token prediction.

Tokenization evolution in the notebook:

  1. Custom regex-based tokenizer (SimpleTokenizerV1) — ~10,980 vocab tokens from TinyStories
  2. Upgraded to tiktoken GPT-2 BPE tokenizer — 50,257 vocab tokens

Stage 2 — Fine-tuning as an Interview Bot

After pretraining, the model is fine-tuned on instruction-following data using the Alpaca prompt format:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What is the difference between a list and a tuple in Python?

### Response:
A list is mutable while a tuple is immutable...

Key fine-tuning details:

  • Optimizer: AdamW (lr=5e-5, weight_decay=0.1)
  • Batch size: 8
  • Epochs: 1–2
  • Padding: Custom collate function masks padding tokens with ignore_index=-100 so they don't contribute to loss
  • Max sequence length: 1024 tokens

Using the Interview Bot

After fine-tuning, use the ask() function to query the model:

questions = [
    "What is the difference between a list and a tuple in Python?",
    "Explain what a decorator is in Python.",
    "What is the time complexity of binary search?",
    "What is a deadlock in operating systems?",
]

for q in questions:
    print(f"Q: {q}")
    print(f"A: {ask(q)}")
    print("---")

Sample output:

Q: What is a deadlock in operating systems?
A: A deadlock is a situation in which a process tries to do more work
   than it has access to doing...

Note: Since this is a small GPT-2-scale model trained for a short time, answers are directionally correct but may lack precision. Fine-tuning on more data or for more epochs will improve quality.


Saving & Loading the Model

Save:

torch.save(model.state_dict(), "gpt2_interview_bot.pth")

Download from Colab:

from google.colab import files
files.download("gpt2_interview_bot.pth")

Reload:

model = GPTModel(BASE_CONFIG)
model.load_state_dict(torch.load("gpt2_interview_bot.pth", map_location=device))
model.eval()

Key Concepts Covered

  • Byte Pair Encoding (BPE) tokenization
  • Token + positional embeddings
  • Scaled dot-product attention and causal masking
  • Multi-head self-attention
  • Transformer blocks with residual connections
  • Next-token prediction pretraining
  • Instruction fine-tuning with masked loss (padding tokens ignored)
  • Greedy / EOS-aware text generation

References

About

GPT-2 style LLM built from scratch, pretrained on TinyStories, and fine-tuned into a functional Interview Bot using instruction-following data. Built with PyTorch

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors