한국어 | English
⚠️ Unofficial community package. This project is not an official distribution of edwardkim/rhwp. The namerhwp-pythonis used so that the upstream maintainer can take therhwpname on PyPI if they choose to publish officially. Please file bugs in the rhwp core upstream at edwardkim/rhwp/issues.
PyO3 Python bindings for rhwp, a Rust-based parser and renderer for HWP/HWPX documents (Korean word processor format, 한컴오피스).
- PyPI package:
rhwp-python - Python import:
import rhwp - Rust core: pinned via git submodule at
external/rhwp
- HWP + HWPX — The major Python alternative
pyhwpsupports only HWP5 and has been unmaintained since 2016. rhwp handles both formats with the same API. - 62× faster than
pyhwpon HWP5 text extraction (96 ms vs 5,980 ms, sandbox benchmark). - LangChain-ready —
rhwp.integrations.langchain.HwpLoaderships as optional extras, plugging directly into RAG pipelines. - Typed —
py.typed+.pyistubs, pyright clean.
- Python 3.10+ (abi3-py310 wheel covers 3.10 through 3.13+)
- No runtime Python dependencies for the core API
rhwp-python[langchain]extras pull inlangchain-core>=0.2only
pip install rhwp-python
# or
uv add rhwp-pythonimport rhwp
# Parse HWP / HWPX — file loading + parsing released from GIL
doc = rhwp.parse("report.hwp")
print(doc.section_count, doc.paragraph_count, doc.page_count)
# Text
full_text: str = doc.extract_text() # non-empty paragraphs joined by "\n"
paragraphs: list[str] = doc.paragraphs() # raw list, includes empty paragraphs
# SVG rendering — one page, or all pages
svg_page0: str = doc.render_svg(page=0)
all_svgs: list[str] = doc.render_all_svg()
written: list[str] = doc.export_svg("output/", prefix="page")
# → writes page_001.svg, page_002.svg, ... (single-page: page.svg)
# PDF rendering — returns Python `bytes`, not list[int]
pdf: bytes = doc.render_pdf()
byte_size: int = doc.export_pdf("output.pdf")rhwp.Document(path) works identically to rhwp.parse(path).
pip install "rhwp-python[langchain]"from rhwp.integrations.langchain import HwpLoader
# Whole document as a single Document (default — single mode)
docs = HwpLoader("report.hwp").load()
# One Document per non-empty paragraph (for RAG chunking — paragraph mode)
docs = HwpLoader("report.hwp", mode="paragraph").load()
# lazy_load: Document objects yielded on-the-fly (O(1) peak memory in paragraph mode)
for d in HwpLoader("report.hwp", mode="paragraph").lazy_load():
index_into_vector_store(d) # your pipeline
# Plugs into the standard LangChain text splitter
from langchain_text_splitters import RecursiveCharacterTextSplitter
chunks = RecursiveCharacterTextSplitter(chunk_size=500).split_documents(docs)Metadata preserved on every Document: source, section_count, paragraph_count,
page_count, rhwp_version, plus paragraph_index in paragraph mode.
Release build on Apple M2 (8 cores). Parse = file read + full parse + Document creation.
Workload: 9 files (aift.hwp 5.5 MB + table-vpos-01.hwpx 359 KB + tac-img-02.hwpx 3.96 MB, ×3).
| Workers | Parse time | Speedup vs serial |
|---|---|---|
| 1 | 268 ms | 1.00× (baseline) |
| 2 | 141 ms | 1.91× |
| 4 | 97 ms | 2.76× |
| 8 | 67 ms | 4.01× |
parse() and the PDF conversion step release the GIL via py.detach, so
ThreadPoolExecutor scales with cores for batch workloads. PDF rendering itself is
CPU/allocator-bound inside usvg + pdf-writer, so parallelization yields only
약 1.1× on 2–3 workers — see benches/bench_gil.py for reproducible measurement.
Operational constraints (Document single-threaded model, async entry point,
PDF stdout noise) and unimplemented areas are summarized in
KNOWN_ISSUES.md (Korean). In-progress / planned items live
in the active spec index at docs/roadmap/.
This project consumes the rhwp Rust core as a git submodule at external/rhwp.
git clone --recurse-submodules https://github.com/DanMeon/rhwp-python
cd rhwp-python
# install dev+testing+linting tools
uv sync --no-install-project --group all
uv run maturin develop --release
# test (core + LangChain, excluding slow PDF tests)
uv run pytest tests/ -m "not slow"
# PDF rendering tests
uv run pytest tests/ -m slow
# type check
uv run pyright python/ tests/
# benchmark GIL release
uv run python benches/bench_gil.py 2>&1 | grep -v -E "(DEBUG_TAB_POS|LAYOUT_OVERFLOW)"If you forgot --recurse-submodules at clone time:
git submodule update --init --recursiveTest fixtures live in the submodule at external/rhwp/samples/; tests/conftest.py
reads from that path.
For the full build / test / contribute flow, see CONTRIBUTING_EN.md.
This Python package and the rhwp Rust core are versioned independently.
rhwp.version() returns this package's version; rhwp.rhwp_core_version()
returns the Rust core version bundled in the pinned submodule.
MIT. Copyright holders: Edward Kim (rhwp Rust core) and DanMeon (rhwp-python bindings). See LICENSE.
- Bindings source + issues: https://github.com/DanMeon/rhwp-python
- rhwp Rust core: https://github.com/edwardkim/rhwp