LLMs for Code Generation with Retrieval-Augmented Generation (RAG)

This repository contains the source code and findings for the project "RP7.1: LLMs for Code Generation," completed for the Software Engineering 2 course at Politecnico di Milano (A.Y. 2025-2026).

The project investigates whether Retrieval-Augmented Generation (RAG) and advanced prompt engineering can improve the quality of library-centric code generated by Large Language Models (LLMs).

Project Team

Authors:

Giacomo Colosio
Patrizio Acquadro
Tito Nicola Drugman

Supervisors (Professors & PhDs):

Matteo Camilli
Davide Yi Xian Hu
Vincenzo Scotti
Giovanni Ennio Quattrocchi

Abstract

This project investigates whether retrieval-augmented generation (RAG) and prompt engineering improve large language models (LLMs) for library-centric code generation. Using CodeLlama-7B-Instruct, we evaluate seven prompt families under baseline (no retrieval) and RAG settings. The knowledge base contains over 7.6k code snippets from two target libraries (PySCF and SEED-Emulator). We compare three single-hop retrievers—BM25 (lexical), CodeBERT (semantic), and a hybrid model—and two multi-hop pipelines (decomposition and iterative refinement). Results show that semantic retrieval with CodeBERT consistently and substantially outperforms both the baseline and BM25. Minimal prompts work best for the baseline, suggesting that added structure without external context can introduce overhead, while advanced prompting combined with high-quality retrieval yields the best results.

Key Features

Retrieval-Augmented Generation (RAG): Enhances an LLM with external knowledge by retrieving relevant code snippets before generation.
Prompt Engineering: Explores seven distinct prompt families, from minimal task descriptions to checklist-driven and persona-based styles.
Advanced Retrieval Strategies:
- Single-Hop: BM25 (lexical), CodeBERT with Cosine Similarity (semantic), and a Hybrid approach using Reciprocal Rank Fusion (RRF).
- Multi-Hop: Decomposition ("divide-and-conquer") and Iterative Refinement pipelines to handle complex queries.
Comprehensive Evaluation: Uses the CodeBLEU metric to assess the syntactic and semantic quality of the generated code.

Repository Structure

The repository is organized into modules responsible for different stages of the RAG pipeline:

├───data/                  # Datasets and knowledge base files
├───prompts/               # Python modules defining all prompt templates (v1-v9)
├───snippet_retrievers/    # Implementations of BM25, Cosine, and Hybrid retrievers
├───S6/                    # Multi-hop pipeline logic (decomposition and iterative)
├───codegen/               # Scripts for running the code generation pipeline
├───code_evaluation/       # CodeBLEU evaluation scripts
├───models/                # LLM loading, quantization, and client configurations
├───results/               # Output directories for generated code and evaluation scores
├───main.ipynb             # Main Jupyter Notebook to run the end-to-end pipeline
└───complete_code.ipynb    # A comprehensive notebook containing the entire project workflow

Getting Started

Prerequisites

The experiments were conducted on a workstation with the following specifications. A similar setup is recommended for reproducibility.

GPU: 2x NVIDIA GeForce RTX 5060 Ti (16 GB VRAM each) or equivalent
CPU: 16 Cores
RAM: 128 GB
OS: Windows/Linux
Python: 3.10+

Installation

Clone the repository:

git clone https://github.com/your-username/RAG-Code-Generation.git
cd RAG-Code-Generation

Create and activate a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

Install the required dependencies:
```
pip install -r requirements.txt
```
API Keys (Optional): If you plan to use cloud-based models (like Gemini or OpenRouter) for multi-hop strategies, create a .env file in the root directory and add your API keys:
```
GOOGLE_API_KEY="your-google-api-key"
OPENROUTER_API_KEY="your-openrouter-api-key"
```

How to Run

The entire experimental pipeline, from data preparation to evaluation, is orchestrated through Jupyter Notebooks.

Launch Jupyter:
```
jupyter notebook
```
Open main.ipynb or complete_code.ipynb:
- main.ipynb provides a structured, step-by-step walkthrough of the pipeline.
- complete_code.ipynb contains the full, integrated codebase for a comprehensive view.
Follow the notebook cells: The notebooks guide you through each phase:
- Loading the LLM and tokenizer (with 4-bit quantization).
- Preparing datasets and the knowledge base.
- Running different retrieval strategies (BM25, Cosine, Hybrid, Multi-hop).
- Generating prompts for each configuration.
- Executing code generation.
- Evaluating the outputs using the CodeBLEU metric.

Key Findings

The study yielded several key insights into building effective RAG systems for code generation:

Semantic Retrieval is King: The semantic retriever (CodeBERT) achieved the highest and most consistent scores (~0.25 CodeBLEU), significantly outperforming the lexical (BM25) and baseline approaches.
Advanced RAG Rivals Pure Semantics: The multi-hop decomposition strategy, when paired with a detailed checklist-style prompt (v8), achieved a top-tier performance of 0.2505, proving its effectiveness for complex tasks.
Prompt Quality Matters: For the baseline (no RAG), minimal and direct prompts performed best (0.1859 with v1). Adding structural complexity without relevant context snippets tended to confuse the model and degrade performance.
Hybrid Methods Offer a Middle Ground: The hybrid retriever (BM25 + CodeBERT) improved upon pure BM25 but could not match the performance of pure semantic retrieval, suggesting that for this dataset, the lexical signal added some noise.

For a detailed breakdown of scores across all variants and prompt versions, please refer to Table 1 in the project report PDF.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
BM25		BM25
COSINE		COSINE
HYBRID		HYBRID
S6		S6
code_evaluation		code_evaluation
codebleu_F		codebleu_F
codegen		codegen
data		data
datavis		datavis
generation		generation
models		models
pipeline		pipeline
promptgen		promptgen
prompts		prompts
prompts_common		prompts_common
rag_prompts/bm25		rag_prompts/bm25
s5_analysis		s5_analysis
snippet_retrievers		snippet_retrievers
tests		tests
utils		utils
.gitignore		.gitignore
2.3.1		2.3.1
README.md		README.md
codebleu_pivot_by_variant_prompt.csv		codebleu_pivot_by_variant_prompt.csv
complete_code.ipynb		complete_code.ipynb
main.ipynb		main.ipynb
report_RP7_1.pdf		report_RP7_1.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLMs for Code Generation with Retrieval-Augmented Generation (RAG)

Project Team

Abstract

Key Features

Repository Structure

Getting Started

Prerequisites

Installation

How to Run

Key Findings

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLMs for Code Generation with Retrieval-Augmented Generation (RAG)

Project Team

Abstract

Key Features

Repository Structure

Getting Started

Prerequisites

Installation

How to Run

Key Findings

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages