LLMsaga is a fully open-source, vision-centric exploration of multimodal large language models (MLLMs). This project implements state-of-the-art techniques for integrating vision encoders with language models, enabling models to understand and reason about images alongside text.
- Project Overview
- Architecture
- Key Components
- Installation
- Quick Start
- Training
- Inference
- Evaluation
- Data Engine
- Configuration
- Project Structure
- License
LLMsaga is designed to create efficient, scalable multimodal language models that excel at vision-language tasks. Key highlights:
- Vision-Centric Design: Specialized handling of image inputs with adaptive vision sampling
- Multiple LLM Backends: Support for LLaMA, Mistral, Phi-3, Cohere, and Gemma
- Flexible Training: FSDP (Fully Sharded Data Parallel), TPU, and memory-efficient training options
- Production-Ready Serving: Gradio web interface and FastAPI-based model serving
- Comprehensive Evaluation: 20+ benchmark datasets for multimodal understanding
- Scalable Data Processing: Tools for synthetic data generation and curation
Input Image
↓
[Vision Encoder(s)] → [Vision Projector(s)] → [Vision Tokens]
↓
[Language Model]
↓
Output Text
Encode visual information from images into token representations:
- Multiple encoder support (CLIP, DINOv2, etc.)
- Configurable patch sizes and resolutions
- Token sampling strategies for efficiency
- Supports both global and patch-level features
Projects vision tokens into language model embedding space:
- Linear, MLP, and advanced projection architectures
- Learnable projection weights during pre-training
- Can be fine-tuned during instruction tuning
Adaptive sampling strategy for vision tokens:
- Dynamic token reduction based on image complexity
- Important region prioritization
- Memory-efficient processing of high-resolution images
Multiple LLM backends with vision-language fusion:
- CambrianLLaMA: LLaMA-based implementation
- CambrianMistral: Mistral-based implementation
- CambrianPhi3: Phi-3 based implementation
- CambrianGemma: Gemma-based implementation
- CambrianCohere: Cohere-based implementation
Central module containing all model definitions and utilities:
cambrian_arch.py: Base architecture class defining vision-language integrationbuilder.py: Factory functions for loading and building modelsconversation.py: Conversation templates and formatting utilitiesutils.py: Common utilities (tokenization, image processing, etc.)mm_utils.py: Multimodal utilities for vision processing
Multiple training backends for different hardware configurations:
train_fsdp.py: Fully Sharded Data Parallel training (multi-GPU)train_tpu.py: Google Cloud TPU trainingtrain_xformers.py: Memory-efficient training with xFormerscambrian_trainer.py: Custom trainer extending HuggingFace Trainer- Callbacks:
wandb_nan_alert_callback.py: NaN detection and alertinggcloud_rsync_callback.py: GCP integration for model syncing
Production-ready inference infrastructure:
gradio_web_server.py: Interactive Gradio web interfacecontroller.py: Distributed inference controllermodel_worker.py: Individual model inference workerscli.py: Command-line interface for inferencesglang_worker.py: SGLang integration for optimized inference
Data generation and processing pipeline:
generate_qa.py: Synthetic QA pair generationgenerate_vqa.py: Visual question answering data generationgenerate_topics.py: Topic-based data generationwikiflow.py: Wikipedia-based data pipelineprocess_json_files.py: JSON data processing utilities
Comprehensive benchmark suite with 20+ datasets:
- Benchmarks: MMBench, MME, MMMU, MMVet, SEED, ChartQA, DocVQA, TextVQA, GQA, AI2D, ScienceQA, and more
- Scripts:
run_benchmark.sh: Single benchmark runnerrun_all_benchmarks.sh: Batch evaluationconsolidate.py: Result aggregation and tabulation
- SLURM Integration: For cluster-based evaluation
- Python >= 3.8
- PyTorch >= 2.2.0
- CUDA 11.8+ (for GPU training/inference)
- 40GB+ GPU VRAM (for full model inference)
# Clone and navigate
cd /path/to/LLMsaga
# Install dependencies
pip install -e .
# For GPU-based training/inference
pip install -e ".[gpu]"
# For TPU training
pip install -e ".[tpu]"# For quantization and optimization
pip install bitsandbytes deepspeed
# For WandB experiment tracking
pip install wandb
# For inference visualization
pip install gradio gradio_client# Load a model and run inference
python -m cam.serve.cli \
--model-path /path/to/llmsaga-model \
--image-file /path/to/image.jpg \
--query "What is in this image?"# Start the Gradio web server
python -m cam.serve.gradio_web_server \
--controller http://localhost:10000 \
--model-list model_list.jsonfrom cam.model.builder import load_pretrained_model
from cam.utils import process_image
# Load model
model, processor, tokenizer = load_pretrained_model(
model_path="path/to/llmsaga-model",
model_base="llama-7b", # base model
)
# Process image
image = process_image("path/to/image.jpg")
# Generate response
with torch.no_grad():
output = model.generate(
images=[image],
prompts=["Describe this image"],
max_new_tokens=128
)Training is configured via JSON files specifying model architecture, data, and optimization:
cat > train_config.json << EOF
{
"model_name_or_path": "lmsys/vicuna-7b-v1.5",
"vision_tower": "openai/clip-vit-large-patch14-336",
"mm_vision_select_layer": -2,
"image_aspect_ratio": "pad",
"tune_mm_mlp_adapter": true,
"bf16": true,
"output_dir": "./checkpoints/cambrian-7b",
"num_train_epochs": 1,
"per_device_train_batch_size": 16,
"gradient_accumulation_steps": 4,
"learning_rate": 2e-5,
"warmup_ratio": 0.03,
"weight_decay": 0.0,
"lr_scheduler_type": "cosine"
}
EOFtorchrun --nproc_per_node 8 cam/train/train_fsdp.py \
--deepspeed scripts/zero3.json \
--model_name_or_path lmsys/vicuna-7b-v1.5 \
--vision_tower openai/clip-vit-large-patch14-336 \
--data_path data/train.json \
--image_folder data/images \
--output_dir ./checkpoints/llmsaga \
--num_train_epochs 1 \
--per_device_train_batch_size 16 \
--learning_rate 2e-5 \
--warmup_ratio 0.03 \
--dataloader_num_workers 4 \
--bf16python cam/train/train_tpu.py \
--model_name_or_path lmsys/vicuna-7b-v1.5 \
--vision_tower openai/clip-vit-large-patch14-336 \
--data_path data/train.json \
--image_folder data/images \
--output_dir ./checkpoints/llmsaga \
--num_train_epochs 1 \
--per_device_train_batch_size 16python cam/train/train_xformers.py \
--load_4bit \
--lora_r 64 \
--lora_alpha 16 \
--model_name_or_path lmsys/vicuna-7b-v1.5 \
--vision_tower openai/clip-vit-large-patch14-336 \
--data_path data/train.json \
--output_dir ./checkpoints/llmsagapython inference.py \llmsaga
--model-path /path/to/cambrian \
--image-folder /path/to/images \
--output-file results.json \
--batch-size 8# Start controller
python -m cam.serve.controller --host localhost --port 10000
# Start model workers (multiple for parallelization)
python -m cam.serve.model_worker \
--controller http://localhost:10000 \
--worker-address http:llmsaga
# Send requests
curl -X POST http://localhost:10000/api/v1/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llmsaga
"model": "cambrian-7b",
"prompt": "What is in this image?",
"image": "base64_encoded_image"
}'bash eval/scripts/run_bencllmsaga \
--bench-name mmbench_en \
--output-dir ./results# Submit all benchmarks in parallel
bash eval/slurm/submit_all_benchmarks_parallel.bash \
--model-path /path/to/llmsagarks_parallel.bash \
--model-path /path/to/cambrian \
--output-dir ./results
# Wait for completion
# Consolidate results
python eval/scripts/consolidate.py \
--results-dir ./results \
--output results.csv| Benchmark | Type | Typical Task |
|---|---|---|
| MMBench | General | Multiple-choice vision QA |
| MME | Compositional | Recognition, OCR, knowledge |
| MMMU | University-level | Complex multimodal reasoning |
| MMVet | Diagnostic | Fine-grained capability testing |
| SEED | Diagnostic | Vision-language understanding |
| ChartQA | Chart Understanding | Quantitative reasoning |
| DocVQA | Document Analysis | Layout + text reasoning |
| TextVQA | OCR + Reasoning | Scene text understanding |
| GQA | Scene Graphs | Spatial reasoning |
| AI2D | Diagrams | Diagram understanding |
| ScienceQA | Scientific | Domain-specific reasoning |
| VQA v2 | General | General visual QA |
| COCO Captions | Captioning | Image description |
python dataEngine/generate_qa.py \
--input-file data/input_topics.txt \
--output-file data/qa_pairs.json \
--num-samples 10000python dataEngine/generate_vqa.py \
--images-dir /path/to/images \
--output-file data/vqa.jsonpython dataEngine/process_json_files.py \
--input-dir data/raw \
--output-dir data/processed \
--format llavapython dataEngine/wikiflow.py \
--source wikipedia \
--output-dir data/wiki_data \
--num-workers 8Key hyperparameters in training config:
vision_tower: Vision encoder (e.g.,openai/clip-vit-large-patch14-336)mm_vision_select_layer: Layer index from vision encoder (-1 for last, -2 for second-to-last)image_aspect_ratio: Image padding strategy (pad,crop, orpad_any)tune_mm_mlp_adapter: Whether to train multimodal projectorvision_select_features: Feature selection strategy
{
"load_in_4bit": true,
"bnb_4bit_compute_dtype": "float16",
"bnb_4bit_use_double_quant": true,
"bnb_4bit_quant_type": "nf4"
}scripts/zero2.json: ZeRO Stage 2 (CPU offloading)scripts/zero3.json: ZeRO Stage 3 (Full sharding)scripts/zero3_offload.json: ZeRO Stage 3 with CPU offloading
LLMsaga/
├── cam/ # Core model implementations
│ ├── model/
│ │ ├── cambrian_arch.py # Base architecture
│ │ ├── builder.py # Model factory functions
│ │ ├── language_model/ # LLM backends
│ │ ├── multimodal_encoder/ # Vision encoders
│ │ ├── multimodal_projector/ # Vision-text projectors
│ │ └── vision_sampler.py # Token sampling strategy
│ ├── serve/ # Inference infrastructure
│ │ ├── gradio_web_server.py # Web UI
│ │ ├── model_worker.py # Inference workers
│ │ ├── controller.py # Distributed controller
│ │ └── cli.py # CLI interface
│ ├── train/ # Training backends
│ │ ├── train_fsdp.py # Multi-GPU FSDP training
│ │ ├── train_tpu.py # TPU training
│ │ ├── train_xformers.py # Memory-efficient training
│ │ └── cambrian_trainer.py # Custom trainer
│ ├── constants.py # Model constants
│ └── utils.py # Common utilities
├── dataEngine/ # Data generation and processing
│ ├── generate_qa.py # QA synthesis
│ ├── generate_vqa.py # VQA synthesis
│ ├── wikiflow.py # Data pipeline
│ └── process_json_files.py # Data processing
├── eval/ # Evaluation infrastructure
│ ├── eval/ # Benchmark implementations
│ ├── scripts/ # Evaluation scripts
│ └── slurm/ # SLURM job templates
├── scripts/ # Training and infra scripts
│ ├── cambrian/ # Cambrian training scripts
│ ├── infra/ # Infrastructure scripts
│ └── zero*.json # DeepSpeed configs
├── inference.py # Batch inference script
├── clear.py # Cleanup utility
├── fsdp_config.json # FSDP configuration
├── pyproject.toml # Package metadata
└── README.md # This file
Models support 7B to 34B parameter configurations:
- 7B Models: 40GB GPU VRAM, ~15-20 tokens/sec inference
- 13B Models: 80GB GPU VRAM (or dual 40GB), ~8-12 tokens/sec inference
- 34B Models: Multiple 40GB+ GPUs, ~3-6 tokens/sec inference
- FSDP: Near-linear scaling with number of GPUs
- TPU: 2-3x speedup over single-GPU training on v3-32 pods
- LoRA Fine-tuning: 4-8x faster than full fine-tuning
LLMsaga
If you use this project in your research, please cite:
@article{llmsaga,
title={LLMsaga,
title={Cambrian: A Fully Open, Vision-Centric Exploration of Multimodal LLMs},
author={Liu, Haotian and others},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2024}
}This project is licensed under the Apache License 2.0 - see LICENSE file for details.
Contributions are welcome! Please open issues for bugs or feature requests, and submit pull requests with improvements.
LLMsaga builds upon open-source projects including:
- HuggingFace Transformers
- PyTorch and PyTorch Lightning
- OpenAI CLIP
- Open source vision models (DINOv2, etc.)
- Community datasets (COCO, Flickr30K, etc.)
For more information and updates, visit the project repository. For more information, visit the official Cambrian website