Skip to content

PatrizioAcquadro/RAG-Code-Generation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLMs for Code Generation with Retrieval-Augmented Generation (RAG)

This repository contains the source code and findings for the project "RP7.1: LLMs for Code Generation," completed for the Software Engineering 2 course at Politecnico di Milano (A.Y. 2025-2026).

The project investigates whether Retrieval-Augmented Generation (RAG) and advanced prompt engineering can improve the quality of library-centric code generated by Large Language Models (LLMs).

Project Team

Authors:

  • Giacomo Colosio
  • Patrizio Acquadro
  • Tito Nicola Drugman

Supervisors (Professors & PhDs):

  • Matteo Camilli
  • Davide Yi Xian Hu
  • Vincenzo Scotti
  • Giovanni Ennio Quattrocchi

Abstract

This project investigates whether retrieval-augmented generation (RAG) and prompt engineering improve large language models (LLMs) for library-centric code generation. Using CodeLlama-7B-Instruct, we evaluate seven prompt families under baseline (no retrieval) and RAG settings. The knowledge base contains over 7.6k code snippets from two target libraries (PySCF and SEED-Emulator). We compare three single-hop retrievers—BM25 (lexical), CodeBERT (semantic), and a hybrid model—and two multi-hop pipelines (decomposition and iterative refinement). Results show that semantic retrieval with CodeBERT consistently and substantially outperforms both the baseline and BM25. Minimal prompts work best for the baseline, suggesting that added structure without external context can introduce overhead, while advanced prompting combined with high-quality retrieval yields the best results.


Key Features

  • Retrieval-Augmented Generation (RAG): Enhances an LLM with external knowledge by retrieving relevant code snippets before generation.
  • Prompt Engineering: Explores seven distinct prompt families, from minimal task descriptions to checklist-driven and persona-based styles.
  • Advanced Retrieval Strategies:
    • Single-Hop: BM25 (lexical), CodeBERT with Cosine Similarity (semantic), and a Hybrid approach using Reciprocal Rank Fusion (RRF).
    • Multi-Hop: Decomposition ("divide-and-conquer") and Iterative Refinement pipelines to handle complex queries.
  • Comprehensive Evaluation: Uses the CodeBLEU metric to assess the syntactic and semantic quality of the generated code.

Repository Structure

The repository is organized into modules responsible for different stages of the RAG pipeline:

├───data/                  # Datasets and knowledge base files
├───prompts/               # Python modules defining all prompt templates (v1-v9)
├───snippet_retrievers/    # Implementations of BM25, Cosine, and Hybrid retrievers
├───S6/                    # Multi-hop pipeline logic (decomposition and iterative)
├───codegen/               # Scripts for running the code generation pipeline
├───code_evaluation/       # CodeBLEU evaluation scripts
├───models/                # LLM loading, quantization, and client configurations
├───results/               # Output directories for generated code and evaluation scores
├───main.ipynb             # Main Jupyter Notebook to run the end-to-end pipeline
└───complete_code.ipynb    # A comprehensive notebook containing the entire project workflow

Getting Started

Prerequisites

The experiments were conducted on a workstation with the following specifications. A similar setup is recommended for reproducibility.

  • GPU: 2x NVIDIA GeForce RTX 5060 Ti (16 GB VRAM each) or equivalent
  • CPU: 16 Cores
  • RAM: 128 GB
  • OS: Windows/Linux
  • Python: 3.10+

Installation

  1. Clone the repository:

    git clone https://github.com/your-username/RAG-Code-Generation.git
    cd RAG-Code-Generation
  2. Create and activate a virtual environment (recommended):

    python -m venv venv
    source venv/bin/activate  # On Windows, use `venv\Scripts\activate`
  3. Install the required dependencies:

    pip install -r requirements.txt
  4. API Keys (Optional): If you plan to use cloud-based models (like Gemini or OpenRouter) for multi-hop strategies, create a .env file in the root directory and add your API keys:

    GOOGLE_API_KEY="your-google-api-key"
    OPENROUTER_API_KEY="your-openrouter-api-key"
    

How to Run

The entire experimental pipeline, from data preparation to evaluation, is orchestrated through Jupyter Notebooks.

  1. Launch Jupyter:

    jupyter notebook
  2. Open main.ipynb or complete_code.ipynb:

    • main.ipynb provides a structured, step-by-step walkthrough of the pipeline.
    • complete_code.ipynb contains the full, integrated codebase for a comprehensive view.
  3. Follow the notebook cells: The notebooks guide you through each phase:

    • Loading the LLM and tokenizer (with 4-bit quantization).
    • Preparing datasets and the knowledge base.
    • Running different retrieval strategies (BM25, Cosine, Hybrid, Multi-hop).
    • Generating prompts for each configuration.
    • Executing code generation.
    • Evaluating the outputs using the CodeBLEU metric.

Key Findings

The study yielded several key insights into building effective RAG systems for code generation:

  • Semantic Retrieval is King: The semantic retriever (CodeBERT) achieved the highest and most consistent scores (~0.25 CodeBLEU), significantly outperforming the lexical (BM25) and baseline approaches.
  • Advanced RAG Rivals Pure Semantics: The multi-hop decomposition strategy, when paired with a detailed checklist-style prompt (v8), achieved a top-tier performance of 0.2505, proving its effectiveness for complex tasks.
  • Prompt Quality Matters: For the baseline (no RAG), minimal and direct prompts performed best (0.1859 with v1). Adding structural complexity without relevant context snippets tended to confuse the model and degrade performance.
  • Hybrid Methods Offer a Middle Ground: The hybrid retriever (BM25 + CodeBERT) improved upon pure BM25 but could not match the performance of pure semantic retrieval, suggesting that for this dataset, the lexical signal added some noise.

For a detailed breakdown of scores across all variants and prompt versions, please refer to Table 1 in the project report PDF.

About

RAG and prompt engineering for LLM-based code generation, evaluated with CodeLlama-7B and CodeBLEU.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors