Skip to content

Saba-Kublashvili/MoE-LLM

Repository files navigation

MoE-LLM: Advanced Modular Transformer Architecture

Python PyTorch License Status

MoE-LLM is a comprehensive, ground-up implementation of a modern Large Language Model (LLM) architecture.

Unlike standard tutorials that implement the original 2017 Transformer, MoE-LLM aggregates state-of-the-art (SOTA) architectural improvements used in modern foundation models (like Llama 3, Mixtral, and PaLM) into a single, cohesive codebase.

The primary goal of this project is architectural transparency and modularity. It serves as a practical "living library" where developers can study, extract, and implement advanced techniques like Mixture of Experts, Rotary Embeddings, and SwiGLU activations using pure PyTorch.

Key Features & Technologies

This repository demonstrates how to implement the specific technologies that drive modern LLM performance:

Neural Architecture

  • Mixture of Experts (MoE): Sparse MoE implementation featuring top-k noisy gating, shared experts, and auxiliary load-balancing losses (Z-loss) for high parameter efficiency.
  • Rotary Positional Embeddings (RoPE): Relative positional encoding for superior sequence length generalization.
  • Grouped Query Attention (GQA): Inference-optimized attention that reduces KV-cache memory footprint while maintaining performance.
  • Sliding Window Attention (SWA): Vectorized masking implementation for efficient processing of long sequences with local context windows.
  • SwiGLU Activation: Replaces standard ReLU/GELU with the Swish-Gated Linear Unit for improved convergence.
  • RMSNorm: Root Mean Square Normalization for training stability.

Optimization & Training Stability

  • ReZero (Residual Zero): Learnable residual scaling parameters to initialize deep networks effectively.
  • LayerDrop: Structured dropout that skips entire layers during training, acting as a powerful regularizer.
  • Gradient Checkpointing: Memory-efficient training allowing for deeper models on consumer hardware.
  • Flash Attention Ready: Logic structured for easy integration with optimized attention kernels.

Robust Training Loop

  • Adaptive Curriculum Learning: Dynamic training stages that increase sequence length and batch size based on validation perplexity.
  • Mixed Precision (AMP): Native bfloat16 (Ampere+ GPUs) and float16 support via torch.cuda.amp.
  • Lion & AdamW Optimizers: Support for the SOTA Lion optimizer.
  • Gradient Accumulation: Simulates massive batch sizes on limited hardware.

Tokenization

  • Byte-Level BPE: Robust subword tokenization with full Unicode support (via HuggingFace tokenizers).
  • Zero-Dependency Fallback: Includes a character-level tokenizer ensuring the model runs purely on Python standard libraries if needed.

Project Structure

The code is modularized to facilitate component extraction:

.
├── config.py       # Hyperparameters & Dataclasses
├── model.py        # Core Architecture (RoPE, MoE, TransformerBlock, SmartLLM)
├── tokenization.py # SmartBPETokenizer & FallbackCharTokenizer
├── trainer.py      # Curriculum Trainer, Mixed Precision, & Dataset Logic
├── inference.py    # Generation Logic (Top-k, Top-p, Repetition Penalty)
└── main.py         # Entry point & CLI



## Installation

### Clone the repository:

```bash
git clone https://github.com/Saba-Kublashvili/MoE-LLM.git
cd MoE-LLM

Install dependencies:

pip install torch tqdm

# Highly recommended for full feature set:
pip install tokenizers datasets lion-pytorch

Usage

1. Training

To train the model (defaults to the databricks-dolly-15k dataset for instruction tuning):

python main.py

The script handles dataset downloading, tokenizer training, and curriculum loop automatically.

2. Inference / Chat

The entry point includes an interactive chat interface with advanced sampling controls:

Smart LLM Interactive Chat
You: Explain how MoE works.
Assistant: [Model Generation...]

3. Extracting Components

This project is designed as a library. You can import specific architectural components into your own projects:

from model import MixtureOfExperts, ModelConfig

# Initialize a standalone MoE layer
config = ModelConfig(d_model=512, num_experts=4)
moe_layer = MixtureOfExperts(config)

Configuration

The model architecture is fully defined in config.py. You can perform ablation studies by toggling specific SOTA features:

from dataclasses import dataclass

@dataclass
class ModelConfig:
    d_model: int = 512
    n_layers: int = 8
    # Toggles
    use_rotary_embeddings: bool = True
    use_mixture_of_experts: bool = True
    use_sliding_window_attention: bool = True
    use_rezero: bool = True


## Contributing

Contributions are welcome. This project is an educational open-source resource. Possible contribution ideas include:

* ALiBi positional embeddings
* LoRA / QLoRA integration
* Triton kernel optimizations
 
## License

This project is licensed under the MIT License.


About

A modular, custom-made Mixture-of-Experts transformer LLM showcasing modern architectural features from recent LLM research, including RoPE, GQA, sliding-window attention, and curriculum training.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages