Skip to content

Gabriel-Hong/DesignCodeRAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Design-Code-RAG

Design-Code-RAG Interactive Session

A domain-specific RAG (Retrieval-Augmented Generation) system specialized for Korean Building Code (KBC) and seismic design guideline PDFs.

Index your PDF documents, then ask questions in natural language to receive evidence-based answers.

Question> How is the base shear force calculated?

╭──── Answer ─────────────────────────────────────────────╮
│ The base shear force is calculated using the following  │
│ formula:                                                │
│                                                         │
│ V = Cₛ × W                                             │
│                                                         │
│ Where,                                                  │
│ - V: Base shear force (kN)                              │
│ - Cₛ: Seismic response coefficient                     │
│ - W: Effective building weight (kN)                     │
│                                                         │
│ Source: seismic_test_5pages, p.3, 1.2 Design Loads      │
╰─────────────────────────────────────────────────────────╯

Key Features

  • Structure-Aware Chunking — Preserves tables, formulas, and "where" variable definition blocks as single chunks
  • Hybrid Search — Combines BM25 keyword matching + kNN semantic search
  • Nori Korean Morphological Analysis — Removes particles/endings and decompounds words for better Korean search quality
  • Automatic Scanned PDF Detection — Text PDFs are processed with PyMuPDF; scanned PDFs use GPT Vision OCR
  • Interactive Mode — Maintains previous conversation context for follow-up questions

Prerequisites

  • Python 3.11+
  • Docker (for running Elasticsearch)
  • OpenAI API Key

Installation

1. Elasticsearch + Nori Plugin

docker run -d --name design-code-es \
  -p 9200:9200 \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  -e "ES_JAVA_OPTS=-Xms2g -Xmx2g" \
  elasticsearch:8.17.0

# Install Nori Korean morphological analyzer
docker exec design-code-es bin/elasticsearch-plugin install analysis-nori
docker restart design-code-es

Or use docker-compose.yml:

docker compose up -d
docker exec design-code-es bin/elasticsearch-plugin install analysis-nori
docker restart design-code-es

Verify Elasticsearch is running:

curl http://localhost:9200
# → You should see a "You Know, for Search" response

2. Install Python Packages

pip install -e .

3. Configure Environment Variables

cp .env.example .env

Add your OpenAI API key to the .env file:

OPENAI_API_KEY=sk-...
ES_URL=http://localhost:9200

Usage

Document Indexing

# Index a PDF file
design-code-rag ingest data/documents/AIK_G.pdf

# Index all PDFs in a directory
design-code-rag ingest data/documents/

# Re-index (overwrite existing data)
design-code-rag ingest data/documents/AIK_G.pdf --force

Querying

# Single question
design-code-rag ask "How is the base shear force calculated?"

# Interactive mode
design-code-rag chat

Document Management

# List indexed documents
design-code-rag list

# Index statistics
design-code-rag stats

# Delete a document
design-code-rag delete AIK_G

Interactive Session

Start an interactive session with design-code-rag (no subcommand) or design-code-rag chat. Slash commands are available within the session:

Command Description
/ingest <path> [--force] Index documents
/list List indexed documents
/stats Index statistics
/delete <doc_name> Delete a document
/sources Review sources from the previous answer
/clear Clear conversation history
/help Help
/quit Exit session

Architecture

The system operates on two main pipelines: the Ingest Pipeline and the Query Pipeline.

Ingest Pipeline (Document → Vector DB)

PDF File
  │
  ├─ Text PDF  → PyMuPDF4LLM → Markdown conversion
  ├─ Scanned PDF → GPT Vision OCR → Markdown conversion
  │
  ▼
Structure-Aware Chunking (Chunker)
  ├─ Heading-based section splitting
  ├─ Table / formula / text block identification
  └─ Tables & formulas preserved whole; text split at 512-token units
  │
  ▼
OpenAI Embedding → 1536-dimensional vectors
  │
  ▼
Elasticsearch Bulk Indexing

Query Pipeline (Question → Answer)

User Question
  │
  ▼
Query Embedding Generation (OpenAI Embedding)
  │
  ▼
Hybrid Search (Elasticsearch)
  ├─ BM25: Keyword matching (boost 0.3)
  └─ kNN: Cosine similarity (boost 0.7)
  │
  ▼
Top 5 chunks + question → LLM (GPT-5 Mini)
  │
  ▼
Evidence-based answer + sources (document name, page, section)

Project Structure

design-code-rag/
├── config/
│   ├── settings.yaml        # Technical settings: models, tokens, search weights
│   └── prompts.yaml         # System / answer / OCR prompt templates
├── src/
│   ├── config.py            # Config loader (YAML + .env → dataclass)
│   ├── ingest/
│   │   ├── pdf_parser.py    # Text PDF → Markdown
│   │   ├── ocr.py           # Scanned PDF → GPT Vision OCR
│   │   ├── chunker.py       # Structure-aware chunking (tables, formulas, sections)
│   │   ├── embedder.py      # OpenAI embeddings (batch / single)
│   │   └── pipeline.py      # Ingest orchestrator
│   ├── store/
│   │   ├── index_settings.py  # ES index mapping (Nori + dense_vector)
│   │   └── es_store.py        # ES CRUD + hybrid search
│   ├── query/
│   │   ├── retriever.py     # Query embedding → hybrid search
│   │   ├── generator.py     # Context formatting → LLM answer generation
│   │   └── pipeline.py      # Query orchestrator
│   └── cli/
│       ├── app.py           # Click CLI commands
│       └── display.py       # Rich console output
├── tests/                   # pytest tests
├── data/documents/          # PDF file storage
├── pyproject.toml           # Dependencies and build configuration
├── docker-compose.yml       # Elasticsearch Docker configuration
└── .env                     # API keys, ES connection info

Configuration

You can adjust key behaviors in config/settings.yaml without modifying any code:

Setting Default Description
llm.model gpt-5-mini LLM for answer generation
embedding.model text-embedding-3-small Embedding model
embedding.dimensions 1536 Embedding vector dimensions
chunking.max_tokens 512 Maximum tokens per text chunk
chunking.overlap_tokens 64 Overlap tokens between chunks
search.top_k 5 Number of search results to return
search.bm25_boost 0.3 BM25 keyword search weight
search.knn_boost 0.7 kNN semantic search weight
conversation.max_history 5 Number of recent turns to keep in interactive mode

Tech Stack

Component Technology
LLM GPT-5 Mini (OpenAI)
Embedding text-embedding-3-small (1536 dimensions)
Vector Store Elasticsearch 8.17
Korean Morphological Analysis Nori (decompound_mode: mixed)
PDF Parsing PyMuPDF + PyMuPDF4LLM
OCR GPT Vision
CLI Click + Rich

About

Domain-specific RAG system for Korean building codes using Elasticsearch hybrid search

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages