MoE-LLM: Advanced Modular Transformer Architecture

MoE-LLM is a comprehensive, ground-up implementation of a modern Large Language Model (LLM) architecture.

Unlike standard tutorials that implement the original 2017 Transformer, MoE-LLM aggregates state-of-the-art (SOTA) architectural improvements used in modern foundation models (like Llama 3, Mixtral, and PaLM) into a single, cohesive codebase.

The primary goal of this project is architectural transparency and modularity. It serves as a practical "living library" where developers can study, extract, and implement advanced techniques like Mixture of Experts, Rotary Embeddings, and SwiGLU activations using pure PyTorch.

Key Features & Technologies

This repository demonstrates how to implement the specific technologies that drive modern LLM performance:

Neural Architecture

Mixture of Experts (MoE): Sparse MoE implementation featuring top-k noisy gating, shared experts, and auxiliary load-balancing losses (Z-loss) for high parameter efficiency.
Rotary Positional Embeddings (RoPE): Relative positional encoding for superior sequence length generalization.
Grouped Query Attention (GQA): Inference-optimized attention that reduces KV-cache memory footprint while maintaining performance.
Sliding Window Attention (SWA): Vectorized masking implementation for efficient processing of long sequences with local context windows.
SwiGLU Activation: Replaces standard ReLU/GELU with the Swish-Gated Linear Unit for improved convergence.
RMSNorm: Root Mean Square Normalization for training stability.

Optimization & Training Stability

ReZero (Residual Zero): Learnable residual scaling parameters to initialize deep networks effectively.
LayerDrop: Structured dropout that skips entire layers during training, acting as a powerful regularizer.
Gradient Checkpointing: Memory-efficient training allowing for deeper models on consumer hardware.
Flash Attention Ready: Logic structured for easy integration with optimized attention kernels.

Robust Training Loop

Adaptive Curriculum Learning: Dynamic training stages that increase sequence length and batch size based on validation perplexity.
Mixed Precision (AMP): Native bfloat16 (Ampere+ GPUs) and float16 support via torch.cuda.amp.
Lion & AdamW Optimizers: Support for the SOTA Lion optimizer.
Gradient Accumulation: Simulates massive batch sizes on limited hardware.

Tokenization

Byte-Level BPE: Robust subword tokenization with full Unicode support (via HuggingFace tokenizers).
Zero-Dependency Fallback: Includes a character-level tokenizer ensuring the model runs purely on Python standard libraries if needed.

Project Structure

The code is modularized to facilitate component extraction:

.
├── config.py       # Hyperparameters & Dataclasses
├── model.py        # Core Architecture (RoPE, MoE, TransformerBlock, SmartLLM)
├── tokenization.py # SmartBPETokenizer & FallbackCharTokenizer
├── trainer.py      # Curriculum Trainer, Mixed Precision, & Dataset Logic
├── inference.py    # Generation Logic (Top-k, Top-p, Repetition Penalty)
└── main.py         # Entry point & CLI



## Installation

### Clone the repository:

```bash
git clone https://github.com/Saba-Kublashvili/MoE-LLM.git
cd MoE-LLM

Install dependencies:

pip install torch tqdm

# Highly recommended for full feature set:
pip install tokenizers datasets lion-pytorch

Usage

1. Training

To train the model (defaults to the databricks-dolly-15k dataset for instruction tuning):

python main.py

The script handles dataset downloading, tokenizer training, and curriculum loop automatically.

2. Inference / Chat

The entry point includes an interactive chat interface with advanced sampling controls:

Smart LLM Interactive Chat
You: Explain how MoE works.
Assistant: [Model Generation...]

3. Extracting Components

This project is designed as a library. You can import specific architectural components into your own projects:

from model import MixtureOfExperts, ModelConfig

# Initialize a standalone MoE layer
config = ModelConfig(d_model=512, num_experts=4)
moe_layer = MixtureOfExperts(config)

Configuration

The model architecture is fully defined in config.py. You can perform ablation studies by toggling specific SOTA features:

from dataclasses import dataclass

@dataclass
class ModelConfig:
    d_model: int = 512
    n_layers: int = 8
    # Toggles
    use_rotary_embeddings: bool = True
    use_mixture_of_experts: bool = True
    use_sliding_window_attention: bool = True
    use_rezero: bool = True


## Contributing

Contributions are welcome. This project is an educational open-source resource. Possible contribution ideas include:

* ALiBi positional embeddings
* LoRA / QLoRA integration
* Triton kernel optimizations
 
## License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
configs		configs
.gitignore		.gitignore
README.md		README.md
config.py		config.py
generate_dataset.py		generate_dataset.py
inference.py		inference.py
main.py		main.py
model.py		model.py
requirements.txt		requirements.txt
tokenization.py		tokenization.py
train_multistep.jsonl		train_multistep.jsonl
trainer.py		trainer.py
validation_multistep.jsonl		validation_multistep.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MoE-LLM: Advanced Modular Transformer Architecture

Key Features & Technologies

Neural Architecture

Optimization & Training Stability

Robust Training Loop

Tokenization

Project Structure

Install dependencies:

Usage

1. Training

2. Inference / Chat

3. Extracting Components

Configuration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MoE-LLM: Advanced Modular Transformer Architecture

Key Features & Technologies

Neural Architecture

Optimization & Training Stability

Robust Training Loop

Tokenization

Project Structure

Install dependencies:

Usage

1. Training

2. Inference / Chat

3. Extracting Components

Configuration

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages