Story Maker SLM (Small Language Model)

A transformer-based Small Language Model trained from scratch to generate creative children's stories. This project implements a GPT-style architecture using PyTorch and trains it on the TinyStories dataset.

🎯 Project Overview

This project demonstrates building a complete language model pipeline from scratch, including:

Custom transformer architecture implementation
Data preprocessing and tokenization
Training loop with validation
Text generation capabilities

The model is trained on the TinyStories dataset, which contains simple children's stories, making it perfect for training a small language model.

🏗️ Architecture

The model implements a GPT-style decoder-only transformer with the following components:

Core Components

Embedding Layer: Converts token IDs to dense vector representations
Positional Encoding: Adds positional information using sinusoidal functions
Multi-Head Attention: Implements self-attention mechanism with multiple heads
Masked Multi-Head Attention: Prevents the model from looking ahead during training
Feed-Forward Network: Two-layer MLP with GELU activation
Layer Normalization: Stabilizes training
Residual Connections: Enables deeper networks

Model Configuration

- Embedding Dimension: 384
- Number of Layers: 6
- Number of Attention Heads: 6
- Vocabulary Size: 50,257 (GPT-2 tokenizer)
- Context Length: 128 tokens
- Dropout: 0.1

📁 Project Structure

transformer/
├── SLM.ipynb                      # Main training notebook
├── Untitled.ipynb                 # Experimental notebook
├── best_model_params.pt           # Trained model weights (162 MB)
├── best_model_params (2).pt       # Alternative model checkpoint (120 MB)
├── train.bin                      # Training data (943 MB)
├── validation.bin                 # Validation data (9.5 MB)

🚀 Getting Started

Prerequisites

pip install torch numpy tiktoken datasets nltk tqdm matplotlib

Dataset Preparation

The model uses the TinyStories dataset with GPT-2 tokenization:

from datasets import load_dataset
import tiktoken

# Load dataset
ds = load_dataset("roneneldan/TinyStories")

# Initialize tokenizer
enc = tiktoken.get_encoding("gpt2")

Training

The training process includes:

Optimizer: Adam with learning rate 1e-4
Loss Function: Cross-Entropy Loss
Learning Rate Scheduler: ReduceLROnPlateau
Training Epochs: 19,500 iterations
Batch Size: 32
Validation: Periodic validation with best model checkpointing

# Initialize model
config = GPTConfig(
    vocab_size=50257,
    block_size=128,
    n_layer=6,
    n_head=6,
    n_embd=384,
    dropout=0.1,
    bias=True
)
model = GPT(config)

# Train the model
# See SLM.ipynb for complete training loop

🎨 Text Generation

Basic Generation

def generatetext(prompt, max_new_tokens=50, temperature=1.0, top_k=None):
    model = GPT(config)
    state_dict = torch.load("best_model_params (2).pt", map_location=device)
    model.load_state_dict(state_dict)
    model.eval()

    input_ids = torch.tensor([enc.encode(prompt)], device=device)

    with torch.inference_mode():
        output_ids = model.generate(
            idx=input_ids,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_k=top_k
        )

    return enc.decode(output_ids[0].tolist())

Example Output

prompt = "Once upon a time"
generated_text = generatetext(prompt, max_new_tokens=200, temperature=0.8, top_k=50)
print(generated_text)

Sample Output:

Once upon a time. One day, there was a little girl named Timmy. One day, a little
girl said, but they got a car to play with his mommy.

When they went back and said, "Lily and her mom. "I need to play with me me?"
His mommy and said, "I can stay. He wanted to do you? I can come and said, "You
can do you too."

📊 Training Results

The model was trained with the following approach:

Data Processing: Binary tokenized format using memory-mapped files for efficiency
Validation Strategy: Regular validation checks with best model saving
Loss Tracking: Both training and validation loss monitored

🔧 Key Features

Custom Transformer Implementation: Built from scratch without using high-level transformer libraries
Efficient Data Loading: Memory-mapped binary files for handling large datasets
Flexible Generation: Supports temperature and top-k sampling for diverse outputs
Model Checkpointing: Automatic saving of best model based on validation loss
Visualization: Training/validation loss plotting for monitoring

📝 Technical Details

Data Preprocessing

Text is tokenized using the GPT-2 BPE tokenizer
Sequences are stored in binary format (.bin files) for fast loading
Memory-mapped arrays prevent RAM overflow

Attention Mechanism

Implements causal (masked) self-attention
Prevents information leakage from future tokens
Multi-head attention with 6 heads for diverse representations

Optimization

AdamW optimizer with weight decay (0.1)
Learning rate scheduling based on validation loss
Gradient clipping for training stability

🎓 Learning Outcomes

This project demonstrates:

Understanding of transformer architecture
Implementation of attention mechanisms
Training large language models
Text generation techniques
PyTorch best practices

📚 References

Attention Is All You Need - Original Transformer Paper
TinyStories Dataset
GPT-2 Paper

🤝 Contributing

Feel free to fork this repository and experiment with:

Different model architectures
Hyperparameter tuning
Alternative datasets
Enhanced generation strategies

📄 License

This project is open source and available for educational purposes.

🙏 Acknowledgments

TinyStories dataset creators
OpenAI for the GPT-2 tokenizer
PyTorch team for the excellent framework

Note: This is an educational project demonstrating transformer implementation from scratch. For production use cases, consider using established libraries like Hugging Face Transformers.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md
SLM.ipynb		SLM.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Story Maker SLM (Small Language Model)

🎯 Project Overview

🏗️ Architecture

Core Components

Model Configuration

📁 Project Structure

🚀 Getting Started

Prerequisites

Dataset Preparation

Training

🎨 Text Generation

Basic Generation

Example Output

📊 Training Results

🔧 Key Features

📝 Technical Details

Data Preprocessing

Attention Mechanism

Optimization

🎓 Learning Outcomes

📚 References

🤝 Contributing

📄 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Story Maker SLM (Small Language Model)

🎯 Project Overview

🏗️ Architecture

Core Components

Model Configuration

📁 Project Structure

🚀 Getting Started

Prerequisites

Dataset Preparation

Training

🎨 Text Generation

Basic Generation

Example Output

📊 Training Results

🔧 Key Features

📝 Technical Details

Data Preprocessing

Attention Mechanism

Optimization

🎓 Learning Outcomes

📚 References

🤝 Contributing

📄 License

🙏 Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages