A domain-specific RAG (Retrieval-Augmented Generation) system specialized for Korean Building Code (KBC) and seismic design guideline PDFs.
Index your PDF documents, then ask questions in natural language to receive evidence-based answers.
Question> How is the base shear force calculated?
╭──── Answer ─────────────────────────────────────────────╮
│ The base shear force is calculated using the following │
│ formula: │
│ │
│ V = Cₛ × W │
│ │
│ Where, │
│ - V: Base shear force (kN) │
│ - Cₛ: Seismic response coefficient │
│ - W: Effective building weight (kN) │
│ │
│ Source: seismic_test_5pages, p.3, 1.2 Design Loads │
╰─────────────────────────────────────────────────────────╯
- Structure-Aware Chunking — Preserves tables, formulas, and "where" variable definition blocks as single chunks
- Hybrid Search — Combines BM25 keyword matching + kNN semantic search
- Nori Korean Morphological Analysis — Removes particles/endings and decompounds words for better Korean search quality
- Automatic Scanned PDF Detection — Text PDFs are processed with PyMuPDF; scanned PDFs use GPT Vision OCR
- Interactive Mode — Maintains previous conversation context for follow-up questions
- Python 3.11+
- Docker (for running Elasticsearch)
- OpenAI API Key
docker run -d --name design-code-es \
-p 9200:9200 \
-e "discovery.type=single-node" \
-e "xpack.security.enabled=false" \
-e "ES_JAVA_OPTS=-Xms2g -Xmx2g" \
elasticsearch:8.17.0
# Install Nori Korean morphological analyzer
docker exec design-code-es bin/elasticsearch-plugin install analysis-nori
docker restart design-code-esOr use docker-compose.yml:
docker compose up -d
docker exec design-code-es bin/elasticsearch-plugin install analysis-nori
docker restart design-code-esVerify Elasticsearch is running:
curl http://localhost:9200
# → You should see a "You Know, for Search" responsepip install -e .cp .env.example .envAdd your OpenAI API key to the .env file:
OPENAI_API_KEY=sk-...
ES_URL=http://localhost:9200
# Index a PDF file
design-code-rag ingest data/documents/AIK_G.pdf
# Index all PDFs in a directory
design-code-rag ingest data/documents/
# Re-index (overwrite existing data)
design-code-rag ingest data/documents/AIK_G.pdf --force# Single question
design-code-rag ask "How is the base shear force calculated?"
# Interactive mode
design-code-rag chat# List indexed documents
design-code-rag list
# Index statistics
design-code-rag stats
# Delete a document
design-code-rag delete AIK_GStart an interactive session with design-code-rag (no subcommand) or design-code-rag chat. Slash commands are available within the session:
| Command | Description |
|---|---|
/ingest <path> [--force] |
Index documents |
/list |
List indexed documents |
/stats |
Index statistics |
/delete <doc_name> |
Delete a document |
/sources |
Review sources from the previous answer |
/clear |
Clear conversation history |
/help |
Help |
/quit |
Exit session |
The system operates on two main pipelines: the Ingest Pipeline and the Query Pipeline.
PDF File
│
├─ Text PDF → PyMuPDF4LLM → Markdown conversion
├─ Scanned PDF → GPT Vision OCR → Markdown conversion
│
▼
Structure-Aware Chunking (Chunker)
├─ Heading-based section splitting
├─ Table / formula / text block identification
└─ Tables & formulas preserved whole; text split at 512-token units
│
▼
OpenAI Embedding → 1536-dimensional vectors
│
▼
Elasticsearch Bulk Indexing
User Question
│
▼
Query Embedding Generation (OpenAI Embedding)
│
▼
Hybrid Search (Elasticsearch)
├─ BM25: Keyword matching (boost 0.3)
└─ kNN: Cosine similarity (boost 0.7)
│
▼
Top 5 chunks + question → LLM (GPT-5 Mini)
│
▼
Evidence-based answer + sources (document name, page, section)
design-code-rag/
├── config/
│ ├── settings.yaml # Technical settings: models, tokens, search weights
│ └── prompts.yaml # System / answer / OCR prompt templates
├── src/
│ ├── config.py # Config loader (YAML + .env → dataclass)
│ ├── ingest/
│ │ ├── pdf_parser.py # Text PDF → Markdown
│ │ ├── ocr.py # Scanned PDF → GPT Vision OCR
│ │ ├── chunker.py # Structure-aware chunking (tables, formulas, sections)
│ │ ├── embedder.py # OpenAI embeddings (batch / single)
│ │ └── pipeline.py # Ingest orchestrator
│ ├── store/
│ │ ├── index_settings.py # ES index mapping (Nori + dense_vector)
│ │ └── es_store.py # ES CRUD + hybrid search
│ ├── query/
│ │ ├── retriever.py # Query embedding → hybrid search
│ │ ├── generator.py # Context formatting → LLM answer generation
│ │ └── pipeline.py # Query orchestrator
│ └── cli/
│ ├── app.py # Click CLI commands
│ └── display.py # Rich console output
├── tests/ # pytest tests
├── data/documents/ # PDF file storage
├── pyproject.toml # Dependencies and build configuration
├── docker-compose.yml # Elasticsearch Docker configuration
└── .env # API keys, ES connection info
You can adjust key behaviors in config/settings.yaml without modifying any code:
| Setting | Default | Description |
|---|---|---|
llm.model |
gpt-5-mini |
LLM for answer generation |
embedding.model |
text-embedding-3-small |
Embedding model |
embedding.dimensions |
1536 |
Embedding vector dimensions |
chunking.max_tokens |
512 |
Maximum tokens per text chunk |
chunking.overlap_tokens |
64 |
Overlap tokens between chunks |
search.top_k |
5 |
Number of search results to return |
search.bm25_boost |
0.3 |
BM25 keyword search weight |
search.knn_boost |
0.7 |
kNN semantic search weight |
conversation.max_history |
5 |
Number of recent turns to keep in interactive mode |
| Component | Technology |
|---|---|
| LLM | GPT-5 Mini (OpenAI) |
| Embedding | text-embedding-3-small (1536 dimensions) |
| Vector Store | Elasticsearch 8.17 |
| Korean Morphological Analysis | Nori (decompound_mode: mixed) |
| PDF Parsing | PyMuPDF + PyMuPDF4LLM |
| OCR | GPT Vision |
| CLI | Click + Rich |
