MoE-LLM is a comprehensive, ground-up implementation of a modern Large Language Model (LLM) architecture.
Unlike standard tutorials that implement the original 2017 Transformer, MoE-LLM aggregates state-of-the-art (SOTA) architectural improvements used in modern foundation models (like Llama 3, Mixtral, and PaLM) into a single, cohesive codebase.
The primary goal of this project is architectural transparency and modularity. It serves as a practical "living library" where developers can study, extract, and implement advanced techniques like Mixture of Experts, Rotary Embeddings, and SwiGLU activations using pure PyTorch.
This repository demonstrates how to implement the specific technologies that drive modern LLM performance:
- Mixture of Experts (MoE): Sparse MoE implementation featuring top-k noisy gating, shared experts, and auxiliary load-balancing losses (Z-loss) for high parameter efficiency.
- Rotary Positional Embeddings (RoPE): Relative positional encoding for superior sequence length generalization.
- Grouped Query Attention (GQA): Inference-optimized attention that reduces KV-cache memory footprint while maintaining performance.
- Sliding Window Attention (SWA): Vectorized masking implementation for efficient processing of long sequences with local context windows.
- SwiGLU Activation: Replaces standard ReLU/GELU with the Swish-Gated Linear Unit for improved convergence.
- RMSNorm: Root Mean Square Normalization for training stability.
- ReZero (Residual Zero): Learnable residual scaling parameters to initialize deep networks effectively.
- LayerDrop: Structured dropout that skips entire layers during training, acting as a powerful regularizer.
- Gradient Checkpointing: Memory-efficient training allowing for deeper models on consumer hardware.
- Flash Attention Ready: Logic structured for easy integration with optimized attention kernels.
- Adaptive Curriculum Learning: Dynamic training stages that increase sequence length and batch size based on validation perplexity.
- Mixed Precision (AMP): Native
bfloat16(Ampere+ GPUs) andfloat16support viatorch.cuda.amp. - Lion & AdamW Optimizers: Support for the SOTA Lion optimizer.
- Gradient Accumulation: Simulates massive batch sizes on limited hardware.
- Byte-Level BPE: Robust subword tokenization with full Unicode support (via HuggingFace
tokenizers). - Zero-Dependency Fallback: Includes a character-level tokenizer ensuring the model runs purely on Python standard libraries if needed.
The code is modularized to facilitate component extraction:
.
├── config.py # Hyperparameters & Dataclasses
├── model.py # Core Architecture (RoPE, MoE, TransformerBlock, SmartLLM)
├── tokenization.py # SmartBPETokenizer & FallbackCharTokenizer
├── trainer.py # Curriculum Trainer, Mixed Precision, & Dataset Logic
├── inference.py # Generation Logic (Top-k, Top-p, Repetition Penalty)
└── main.py # Entry point & CLI
## Installation
### Clone the repository:
```bash
git clone https://github.com/Saba-Kublashvili/MoE-LLM.git
cd MoE-LLMpip install torch tqdm
# Highly recommended for full feature set:
pip install tokenizers datasets lion-pytorchTo train the model (defaults to the databricks-dolly-15k dataset for instruction tuning):
python main.pyThe script handles dataset downloading, tokenizer training, and curriculum loop automatically.
The entry point includes an interactive chat interface with advanced sampling controls:
Smart LLM Interactive Chat
You: Explain how MoE works.
Assistant: [Model Generation...]
This project is designed as a library. You can import specific architectural components into your own projects:
from model import MixtureOfExperts, ModelConfig
# Initialize a standalone MoE layer
config = ModelConfig(d_model=512, num_experts=4)
moe_layer = MixtureOfExperts(config)The model architecture is fully defined in config.py. You can perform ablation studies by toggling specific SOTA features:
from dataclasses import dataclass
@dataclass
class ModelConfig:
d_model: int = 512
n_layers: int = 8
# Toggles
use_rotary_embeddings: bool = True
use_mixture_of_experts: bool = True
use_sliding_window_attention: bool = True
use_rezero: bool = True
## Contributing
Contributions are welcome. This project is an educational open-source resource. Possible contribution ideas include:
* ALiBi positional embeddings
* LoRA / QLoRA integration
* Triton kernel optimizations
## License
This project is licensed under the MIT License.