CodeGen-Alpaca-1B

This repository documents the development of CodeGen-Alpaca-1B, a fine-tuned LLM based on StarCoderBase-1B. It was fine-tuned using the CodeAlpaca (2k subset) dataset with QLoRA on Google Colab, making it lightweight and efficient for code generation tasks.

Demo: Hugging Face

Motivation

I wanted to build a lightweight code generation model that could:

Understand instruction-style prompts
Generate clean, runnable code in 60+ programming languages
Run on free-tier cloud GPUs (like Colab) without huge resource needs

That’s why I chose:

StarCoderBase-1B as the base model (compact, ideal for coding tasks)
CodeAlpaca (2k subset) as the fine-tuning dataset, which contains examples in 60+ programming languages, allowing the model to generate code in any of those languages
QLoRA for fine-tuning (memory-efficient, GPU-friendly)

Dataset

Source: CodeAlpaca

Subset Used: 2k instructions (for faster training on Colab)

Training Setup

Platform: Google Colab (T4 / A100 GPU)

Method: QLoRA (low-rank adaptation for efficiency)

Epochs: 1

Batch Size: Small (Colab-friendly)

Library: 🤗 Transformers + PEFT + Accelerate

Model Access

You can directly use the model here: CodeGen-Alpaca-1B on Hugging Face

Requirements (Colab Setup)

If you are running this model on Google Colab, you’ll need to:

Go to the left sidebar and click the 🔑 (Secrets) tab.

Add a new secret with the name:HF_TOKEN and set the value to your Hugging Face token from here.

Enable Notebook access for your token.

Restart the Colab session.

Then log in inside the notebook:

Results

Generates clean, runnable code in 60+ programming languages.

Output filtering ensures only code is returned (no instruction markers).

Runs smoothly on Colab with ~4.5 GB GPU memory usage.

License

bigcode-openrail-m

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
Data		Data
notebooks		notebooks
README.md		README.md
adapter_config.json		adapter_config.json
adapter_model.safetensors		adapter_model.safetensors
merges.txt		merges.txt
special_tokens_map.json		special_tokens_map.json
tokenizer.json		tokenizer.json
tokenizer_config.json		tokenizer_config.json
vocab.json		vocab.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeGen-Alpaca-1B

Motivation

Dataset

Training Setup

Model Access

Requirements (Colab Setup)

Results

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

CodeGen-Alpaca-1B

Motivation

Dataset

Training Setup

Model Access

Requirements (Colab Setup)

Results

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages