Design-Code-RAG

A domain-specific RAG (Retrieval-Augmented Generation) system specialized for Korean Building Code (KBC) and seismic design guideline PDFs.

Index your PDF documents, then ask questions in natural language to receive evidence-based answers.

Question> How is the base shear force calculated?

╭──── Answer ─────────────────────────────────────────────╮
│ The base shear force is calculated using the following  │
│ formula:                                                │
│                                                         │
│ V = Cₛ × W                                             │
│                                                         │
│ Where,                                                  │
│ - V: Base shear force (kN)                              │
│ - Cₛ: Seismic response coefficient                     │
│ - W: Effective building weight (kN)                     │
│                                                         │
│ Source: seismic_test_5pages, p.3, 1.2 Design Loads      │
╰─────────────────────────────────────────────────────────╯

Key Features

Structure-Aware Chunking — Preserves tables, formulas, and "where" variable definition blocks as single chunks
Hybrid Search — Combines BM25 keyword matching + kNN semantic search
Nori Korean Morphological Analysis — Removes particles/endings and decompounds words for better Korean search quality
Automatic Scanned PDF Detection — Text PDFs are processed with PyMuPDF; scanned PDFs use GPT Vision OCR
Interactive Mode — Maintains previous conversation context for follow-up questions

Prerequisites

Python 3.11+
Docker (for running Elasticsearch)
OpenAI API Key

Installation

1. Elasticsearch + Nori Plugin

docker run -d --name design-code-es \
  -p 9200:9200 \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  -e "ES_JAVA_OPTS=-Xms2g -Xmx2g" \
  elasticsearch:8.17.0

# Install Nori Korean morphological analyzer
docker exec design-code-es bin/elasticsearch-plugin install analysis-nori
docker restart design-code-es

Or use docker-compose.yml:

docker compose up -d
docker exec design-code-es bin/elasticsearch-plugin install analysis-nori
docker restart design-code-es

Verify Elasticsearch is running:

curl http://localhost:9200
# → You should see a "You Know, for Search" response

2. Install Python Packages

pip install -e .

3. Configure Environment Variables

cp .env.example .env

Add your OpenAI API key to the .env file:

OPENAI_API_KEY=sk-...
ES_URL=http://localhost:9200

Usage

Document Indexing

# Index a PDF file
design-code-rag ingest data/documents/AIK_G.pdf

# Index all PDFs in a directory
design-code-rag ingest data/documents/

# Re-index (overwrite existing data)
design-code-rag ingest data/documents/AIK_G.pdf --force

Querying

# Single question
design-code-rag ask "How is the base shear force calculated?"

# Interactive mode
design-code-rag chat

Document Management

# List indexed documents
design-code-rag list

# Index statistics
design-code-rag stats

# Delete a document
design-code-rag delete AIK_G

Interactive Session

Start an interactive session with design-code-rag (no subcommand) or design-code-rag chat. Slash commands are available within the session:

Command	Description
`/ingest <path> [--force]`	Index documents
`/list`	List indexed documents
`/stats`	Index statistics
`/delete <doc_name>`	Delete a document
`/sources`	Review sources from the previous answer
`/clear`	Clear conversation history
`/help`	Help
`/quit`	Exit session

Architecture

The system operates on two main pipelines: the Ingest Pipeline and the Query Pipeline.

Ingest Pipeline (Document → Vector DB)

PDF File
  │
  ├─ Text PDF  → PyMuPDF4LLM → Markdown conversion
  ├─ Scanned PDF → GPT Vision OCR → Markdown conversion
  │
  ▼
Structure-Aware Chunking (Chunker)
  ├─ Heading-based section splitting
  ├─ Table / formula / text block identification
  └─ Tables & formulas preserved whole; text split at 512-token units
  │
  ▼
OpenAI Embedding → 1536-dimensional vectors
  │
  ▼
Elasticsearch Bulk Indexing

Query Pipeline (Question → Answer)

User Question
  │
  ▼
Query Embedding Generation (OpenAI Embedding)
  │
  ▼
Hybrid Search (Elasticsearch)
  ├─ BM25: Keyword matching (boost 0.3)
  └─ kNN: Cosine similarity (boost 0.7)
  │
  ▼
Top 5 chunks + question → LLM (GPT-5 Mini)
  │
  ▼
Evidence-based answer + sources (document name, page, section)

Project Structure

design-code-rag/
├── config/
│   ├── settings.yaml        # Technical settings: models, tokens, search weights
│   └── prompts.yaml         # System / answer / OCR prompt templates
├── src/
│   ├── config.py            # Config loader (YAML + .env → dataclass)
│   ├── ingest/
│   │   ├── pdf_parser.py    # Text PDF → Markdown
│   │   ├── ocr.py           # Scanned PDF → GPT Vision OCR
│   │   ├── chunker.py       # Structure-aware chunking (tables, formulas, sections)
│   │   ├── embedder.py      # OpenAI embeddings (batch / single)
│   │   └── pipeline.py      # Ingest orchestrator
│   ├── store/
│   │   ├── index_settings.py  # ES index mapping (Nori + dense_vector)
│   │   └── es_store.py        # ES CRUD + hybrid search
│   ├── query/
│   │   ├── retriever.py     # Query embedding → hybrid search
│   │   ├── generator.py     # Context formatting → LLM answer generation
│   │   └── pipeline.py      # Query orchestrator
│   └── cli/
│       ├── app.py           # Click CLI commands
│       └── display.py       # Rich console output
├── tests/                   # pytest tests
├── data/documents/          # PDF file storage
├── pyproject.toml           # Dependencies and build configuration
├── docker-compose.yml       # Elasticsearch Docker configuration
└── .env                     # API keys, ES connection info

Configuration

You can adjust key behaviors in config/settings.yaml without modifying any code:

Setting	Default	Description
`llm.model`	`gpt-5-mini`	LLM for answer generation
`embedding.model`	`text-embedding-3-small`	Embedding model
`embedding.dimensions`	`1536`	Embedding vector dimensions
`chunking.max_tokens`	`512`	Maximum tokens per text chunk
`chunking.overlap_tokens`	`64`	Overlap tokens between chunks
`search.top_k`	`5`	Number of search results to return
`search.bm25_boost`	`0.3`	BM25 keyword search weight
`search.knn_boost`	`0.7`	kNN semantic search weight
`conversation.max_history`	`5`	Number of recent turns to keep in interactive mode

Tech Stack

Component	Technology
LLM	GPT-5 Mini (OpenAI)
Embedding	text-embedding-3-small (1536 dimensions)
Vector Store	Elasticsearch 8.17
Korean Morphological Analysis	Nori (decompound_mode: mixed)
PDF Parsing	PyMuPDF + PyMuPDF4LLM
OCR	GPT Vision
CLI	Click + Rich

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
config		config
docs		docs
src		src
tests		tests
tools		tools
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
SETUP.md		SETUP.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Design-Code-RAG

Key Features

Prerequisites

Installation

1. Elasticsearch + Nori Plugin

2. Install Python Packages

3. Configure Environment Variables

Usage

Document Indexing

Querying

Document Management

Interactive Session

Architecture

Ingest Pipeline (Document → Vector DB)

Query Pipeline (Question → Answer)

Project Structure

Configuration

Tech Stack

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Design-Code-RAG

Key Features

Prerequisites

Installation

1. Elasticsearch + Nori Plugin

2. Install Python Packages

3. Configure Environment Variables

Usage

Document Indexing

Querying

Document Management

Interactive Session

Architecture

Ingest Pipeline (Document → Vector DB)

Query Pipeline (Question → Answer)

Project Structure

Configuration

Tech Stack

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages