Korean financial text-table benchmark for evaluating tool-augmented agents on QA tasks
conda create -n sglang python=3.10
conda activate sglangpip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu124pip install flashinfer-python -i https://flashinfer.ai/whl/cu124/torch2.6/pip install sglang[srt]Recommended: Keep vLLM in a separate Conda environment to avoid conflicts.
conda create -n vllm python=3.10
conda activate vllmpip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu124pip install vllm==0.8.5conda create -n kraft3qa python=3.10
conda activate kraft3qa
pip install -e .python scripts/data/0_fetch_listed_companies.py1. Retrieve Coperate Filings and Financial Statements from Open DART
Set your Open DART API key:
export OPENDART_API_KEY='YOUR_API_KEY'Then run:
python scripts/data/1_fetch_dart_filings.py
python scripts/data/2_fetch_dart_finstates.pypython scripts/data/3_extract_business_from_dart_filings.py --outdir data/dart_filings/business/1_overview
python scripts/data/3_extract_business_from_dart_filings.py --outdir data/dart_filings/business/2_products+services
python scripts/data/3_extract_business_from_dart_filings.py --outdir data/dart_filings/business/3_materials+facilities
python scripts/data/3_extract_business_from_dart_filings.py --outdir data/dart_filings/business/4_sales+orders
python scripts/data/3_extract_business_from_dart_filings.py --outdir data/dart_filings/business/5_risk_management
python scripts/data/3_extract_business_from_dart_filings.py --outdir data/dart_filings/business/6_contracts+rnd
python scripts/data/3_extract_business_from_dart_filings.py --outdir data/dart_filings/business/7_othersTransform CSV-format financial statements into structured JSON+HTML tables:
python scripts/data/4_generate_finstate_tables.pypython scripts/data/5_extract_mda_from_dart_filings.pyRun SGLang server with Qwen3 32B:
python -m sglang.launch_server --model-path Qwen/Qwen3-32B --port 8000 --reasoning-parser qwen3Then generate the QA dataset:
python scripts/data/6_make_qa_dataset.pypython -m sglang.launch_server --model-path Qwen/Qwen3-32B --port 8000 --reasoning-parser qwen3
python scripts/data/7_filter_qa_dataset.py --indir data/qa_raw --outdir data/qa_qwen3-filterpython -m sglang.launch_server --model-path LGAI-EXAONE/EXAONE-3.5-32B-Instruct --port 8000 --attention-backend triton
python scripts/data/7_filter_qa_dataset.py --indir data/qa_qwen3-filter --outdir data/qa_qwen3-filter+exaone3.5-filterpython scripts/data/8_shuffle_dataset_answers.py --indir data/qa_qwen3-filter+exaone3.5-filter --outdir dataset| Model | Params | Accuracy (%) | Valid Response Rate (%) |
|---|---|---|---|
| Qwen3 (w/ Thinking) | 32B | 71.2 | 98.8 |
| Gemma 3 | 27B | 66.3 | 99.6 |
| EXAONE 3.5 | 32B | 62.2 | 95.1 |
| Kanana 1.5 | 8B | 54.2 | 92.9 |
| Llama 3.2 | 3B | 23.2 | 61.5 |
| HyperCLOVA X SEED | 1.5B | 14.1 | 41.8 |
If you use KRAFT3-QA, please cite:
@article{park2025kraft3qa,
title = {{KRAFT}³-{QA}: {Korean} financial text-table benchmark for evaluating tool-augmented agents on {QA} tasks},
author = {Park, Seungjae and Cho, Sung-Bae and Kim, Ha-Young},
journal = {Journal of the Korea Society of Computer and Information},
volume = {30},
number = {8},
pages = {29--39},
year = {2025},
doi = {10.9708/jksci.2025.30.08.029},
language = {ko},
}This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No.2023R1A2C200337911 and No. RS-2023-00220762).