A transformer-based Small Language Model trained from scratch to generate creative children's stories. This project implements a GPT-style architecture using PyTorch and trains it on the TinyStories dataset.
This project demonstrates building a complete language model pipeline from scratch, including:
- Custom transformer architecture implementation
- Data preprocessing and tokenization
- Training loop with validation
- Text generation capabilities
The model is trained on the TinyStories dataset, which contains simple children's stories, making it perfect for training a small language model.
The model implements a GPT-style decoder-only transformer with the following components:
- Embedding Layer: Converts token IDs to dense vector representations
- Positional Encoding: Adds positional information using sinusoidal functions
- Multi-Head Attention: Implements self-attention mechanism with multiple heads
- Masked Multi-Head Attention: Prevents the model from looking ahead during training
- Feed-Forward Network: Two-layer MLP with GELU activation
- Layer Normalization: Stabilizes training
- Residual Connections: Enables deeper networks
- Embedding Dimension: 384
- Number of Layers: 6
- Number of Attention Heads: 6
- Vocabulary Size: 50,257 (GPT-2 tokenizer)
- Context Length: 128 tokens
- Dropout: 0.1transformer/
βββ SLM.ipynb # Main training notebook
βββ Untitled.ipynb # Experimental notebook
βββ best_model_params.pt # Trained model weights (162 MB)
βββ best_model_params (2).pt # Alternative model checkpoint (120 MB)
βββ train.bin # Training data (943 MB)
βββ validation.bin # Validation data (9.5 MB)
pip install torch numpy tiktoken datasets nltk tqdm matplotlibThe model uses the TinyStories dataset with GPT-2 tokenization:
from datasets import load_dataset
import tiktoken
# Load dataset
ds = load_dataset("roneneldan/TinyStories")
# Initialize tokenizer
enc = tiktoken.get_encoding("gpt2")The training process includes:
- Optimizer: Adam with learning rate 1e-4
- Loss Function: Cross-Entropy Loss
- Learning Rate Scheduler: ReduceLROnPlateau
- Training Epochs: 19,500 iterations
- Batch Size: 32
- Validation: Periodic validation with best model checkpointing
# Initialize model
config = GPTConfig(
vocab_size=50257,
block_size=128,
n_layer=6,
n_head=6,
n_embd=384,
dropout=0.1,
bias=True
)
model = GPT(config)
# Train the model
# See SLM.ipynb for complete training loopdef generatetext(prompt, max_new_tokens=50, temperature=1.0, top_k=None):
model = GPT(config)
state_dict = torch.load("best_model_params (2).pt", map_location=device)
model.load_state_dict(state_dict)
model.eval()
input_ids = torch.tensor([enc.encode(prompt)], device=device)
with torch.inference_mode():
output_ids = model.generate(
idx=input_ids,
max_new_tokens=max_new_tokens,
temperature=temperature,
top_k=top_k
)
return enc.decode(output_ids[0].tolist())prompt = "Once upon a time"
generated_text = generatetext(prompt, max_new_tokens=200, temperature=0.8, top_k=50)
print(generated_text)Sample Output:
Once upon a time. One day, there was a little girl named Timmy. One day, a little
girl said, but they got a car to play with his mommy.
When they went back and said, "Lily and her mom. "I need to play with me me?"
His mommy and said, "I can stay. He wanted to do you? I can come and said, "You
can do you too."
The model was trained with the following approach:
- Data Processing: Binary tokenized format using memory-mapped files for efficiency
- Validation Strategy: Regular validation checks with best model saving
- Loss Tracking: Both training and validation loss monitored
- Custom Transformer Implementation: Built from scratch without using high-level transformer libraries
- Efficient Data Loading: Memory-mapped binary files for handling large datasets
- Flexible Generation: Supports temperature and top-k sampling for diverse outputs
- Model Checkpointing: Automatic saving of best model based on validation loss
- Visualization: Training/validation loss plotting for monitoring
- Text is tokenized using the GPT-2 BPE tokenizer
- Sequences are stored in binary format (
.binfiles) for fast loading - Memory-mapped arrays prevent RAM overflow
- Implements causal (masked) self-attention
- Prevents information leakage from future tokens
- Multi-head attention with 6 heads for diverse representations
- AdamW optimizer with weight decay (0.1)
- Learning rate scheduling based on validation loss
- Gradient clipping for training stability
This project demonstrates:
- Understanding of transformer architecture
- Implementation of attention mechanisms
- Training large language models
- Text generation techniques
- PyTorch best practices
- Attention Is All You Need - Original Transformer Paper
- TinyStories Dataset
- GPT-2 Paper
Feel free to fork this repository and experiment with:
- Different model architectures
- Hyperparameter tuning
- Alternative datasets
- Enhanced generation strategies
This project is open source and available for educational purposes.
- TinyStories dataset creators
- OpenAI for the GPT-2 tokenizer
- PyTorch team for the excellent framework
Note: This is an educational project demonstrating transformer implementation from scratch. For production use cases, consider using established libraries like Hugging Face Transformers.