Chunky — Markdown-Aware RAG Chunker

A pipeline tool that turns Markdown documents into embedding-ready chunks, builds a provenance knowledge graph, and lets you query your corpus through a Streamlit dashboard. Built to be used as both a GUI and a standalone Python library.

Part I — User Guide

Running the App

Everything is handled by a single launcher script. Clone the repository, then:

python run.py

That is the entire start-up sequence. run.py will:

Read requirements.txt and install all missing dependencies via pip.
Create the sessions/ and tests/ folders if they do not already exist.
Start the Streamlit server on http://localhost:8501 and open a browser tab automatically.

On repeat runs you can skip the install step for a faster launch:

python run.py --no-install

To use a different port (for example when 8501 is already taken):

python run.py --port 8502

What you will see

The sidebar always shows your current progress through six stages and lets you save or load a full session at any time. The main area changes to show only the controls that are relevant to the active stage. Every transition to the next stage requires an explicit button press — adjusting sliders and inputs does not trigger any computation on its own.

Part II — Pipeline Walkthrough

Stage 1 — Upload

Drag a Markdown file onto the uploader (.md only). The app immediately parses the heading structure and shows a collapsible document preview with the detected heading tree. This parse is instantaneous and requires no configuration.

Press Proceed to Configure when you are satisfied that the document was read correctly.

Stage 2 — Configure

Choose a preset to set all knobs at once:

Preset	max_tokens	overlap	splitting strategy	emphasis
Precision	200	25 %	sentence greedy	both
Balanced	400	20 %	paragraph first	both
Speed	512	10 %	paragraph first	strip

The Simple tab exposes the two most impactful knobs — max_tokens and overlap_pct — as sliders with live descriptions. The Advanced tab reveals every option including splitting_unit, emphasis_mode, include_breadcrumb, split_intro_separately, min_chunk_tokens, and cross_section_overlap. Each control has a one-sentence explanation shown inline.

Press Run Chunker to execute the pipeline.

Stage 3 — Chunk Review

A summary row shows total chunk count, mean / min / max token estimates, and the number of chunks that were force-split at sentence boundaries. Below it, a D3 histogram of token distribution is drawn — a red vertical line marks your max_tokens ceiling.

The full chunk table is sortable and filterable by heading level and is_section_intro. Selecting any row opens a detail panel that shows every field of the Chunk dataclass. A breadcrumb tree view mirrors the document's heading structure and lets you navigate by section.

Press Proceed to Embed to continue or Re-run with new config to change settings.

Stage 4 — Embed

Choose an embedding backend:

TF-IDF + LSA (default) — fully local, no downloads, runs in seconds. Produces 128-dimensional vectors using bigram TF-IDF followed by Truncated SVD. Suitable for development, testing, and smaller corpora.
SentenceTransformer (BGE) — downloads BAAI/bge-base-en-v1.5 on first run (~440 MB). Produces 768-dimensional vectors with substantially better semantic fidelity. Recommended for production.

A live log stream shows progress. After embedding, a UMAP 2D scatter plot is rendered via D3 — each point is a chunk, coloured by heading level, and hovering shows the breadcrumb and first 80 characters of the chunk text.

Press Proceed to Build KG to continue.

Stage 5 — Knowledge Graph

The provenance graph is built from the parsed section tree and the chunk list. It is visualised as a D3 force-directed graph inside an iframe, with a toolbar that lets you toggle node labels, hide sections or chunks, freeze the layout, and reset the view. Clicking any node shows its full attributes in a tooltip.

Edge types and their colours:

Edge	Colour	Meaning
`contains_section`	blue	parent section → child section
`has_chunk`	green	section node → its chunks
`next_in_section`	amber	chunk → next chunk from the same section
`parent_context`	red (dashed)	chunk → first chunk of its parent section

The parent_context edge is what makes context-expansion during retrieval possible.

Press Proceed to Query to continue.

Stage 6 — Query

Type a natural language query, set the top-k slider (1–20), and optionally enable Include parent context to follow parent_context edges and automatically add parent intro chunks alongside the top-k hits. Press Run Query to search the ChromaDB in-memory collection.

Results are displayed as ranked chunk cards, each showing the similarity score, breadcrumb path, token count, full text, and any emphasized_terms that were extracted.

Session Save and Load

Use Save session in the sidebar to download a .chunky file containing your document, configuration, chunks, embeddings (base64-encoded), and the knowledge graph. Load session restores the full state including the ChromaDB collection — no recomputation is needed. Save config only saves a lightweight JSON suitable for sharing settings without any document data.

Test Suite

The Tests tab is always accessible. Press Run all tests to execute the 12-test suite against a fixed set of Markdown inputs and expected outputs. Results are shown as a table with pass/fail status and a diff on failure.

Part III — Developer Reference

Repository Layout

chunky/
├── chunker.py        # Pure parsing and chunking logic. Zero GUI or vector-store dependencies.
├── graph.py          # Provenance KG builder. Depends only on networkx and chunker.
├── app.py            # Streamlit GUI. Orchestrates all other modules.
├── run.py            # One-click launcher. Installs deps and starts Streamlit.
├── requirements.txt  # Pinned dependency list.
├── sessions/         # Created at runtime. Stores .chunky session files.
└── tests/            # Created at runtime. Houses the test JSON fixture.

The layering rule is strict: chunker.py has no knowledge of Streamlit, ChromaDB, or sentence-transformers. graph.py has no knowledge of Streamlit or ChromaDB. app.py is the only layer that imports everything.

`chunker.py` — Public API

from chunker import chunk_markdown, ChunkerConfig, PRESETS

# Basic usage
chunks, tree = chunk_markdown(open("doc.md").read(), "doc.md")

# With a custom config
config = ChunkerConfig(max_tokens=300, overlap_pct=0.20, emphasis_mode="both")
chunks, tree = chunk_markdown(text, "doc.md", config)

# Using a preset
chunks, tree = chunk_markdown(text, "doc.md", PRESETS["Precision"])

ChunkerConfig fields:

Field	Type	Default	Description
`max_tokens`	int	400	Hard ceiling on chunk size.
`min_chunk_tokens`	int	20	Chunks below this are discarded unless they are the only output from their section.
`splitting_unit`	str	`"paragraph_first"`	`"paragraph_first"` tries to keep whole paragraphs together; `"sentence_greedy"` splits to sentence level immediately.
`overlap_pct`	float	0.20	Fraction of the previous chunk's tokens to prepend as overlap. Overlap is cut at a clean sentence boundary.
`cross_section_overlap`	bool	False	Allow overlap to cross section boundaries.
`emphasis_mode`	str	`"both"`	`"inline"` transforms `term` → `term (important)`; `"suffix"` appends `Key terms: …`; `"both"` does both; `"strip"` removes markers.
`include_breadcrumb`	bool	True	Prepend `Title > H1 > H2` line to each chunk's text.
`split_intro_separately`	bool	True	Emit intro text (paragraphs before child headings) as its own chunk with `is_section_intro=True`.

Chunk fields:

Field	Type	Description
`id`	str	16-character SHA-256 of breadcrumb + raw body. Deterministic.
`text`	str	Final text for the embedder: overlap prefix + breadcrumb + body + emphasis suffix.
`raw_text`	str	Body text before any emphasis transformation.
`breadcrumb`	list[str]	Heading path from document root to this section.
`level`	int	Heading depth (0 = document root, 1 = H1, …).
`is_section_intro`	bool	True when the node has children and this chunk is the intro text.
`sequence_in_section`	int	0-indexed position among chunks from the same section node.
`overlap_prefix`	str	Trailing text of the previous chunk prepended as context. Empty for the first chunk of a section.
`emphasized_terms`	list[str]	Bold and italic spans extracted from the raw body.
`long_sentence_split`	bool	True if a sentence had to be hard-cut at a word boundary because it exceeded `max_tokens` on its own.
`token_estimate`	int	Approximate token count of `.text` (word count × 1.3).
`char_count`	int	Character count of `.text`.
`source_node_id`	str	ID of the `SectionNode` this chunk was produced from.

`graph.py` — Public API

from graph import build_graph, graph_summary, expand_with_parent_context

# Build from chunker output
G = build_graph(tree, chunks)

# Summarise
print(graph_summary(G))
# {'total_nodes': 42, 'section_nodes': 10, 'chunk_nodes': 32,
#  'total_edges': 67, 'edge_types': {'contains_section': 9, ...}}

# Expand retrieved chunk IDs with parent context
enriched = expand_with_parent_context(G, top_k_ids, chunks_by_id)

Graph serialisation to JSON and back:

from graph import graph_to_dict, graph_from_dict
d = graph_to_dict(G)           # plain dict — safe to json.dumps()
G2 = graph_from_dict(d)        # reconstructs the DiGraph

Embedding Backends

The GUI supports two backends selectable at the embed stage. Both produce normalised float32 vectors stored in st.session_state.embeddings as a NumPy array of shape (n_chunks, dim).

TF-IDF + LSA (embed_backend = "tfidf"): bigram TF-IDF vectoriser followed by TruncatedSVD to 128 components, L2-normalised. The fitted TfidfVectorizer and TruncatedSVD objects are stored in ss.vectorizer and ss.svd so that query vectors are produced consistently.
SentenceTransformer (embed_backend = "sbert"): uses BAAI/bge-base-en-v1.5 via sentence_transformers. Produces 768-dimensional vectors.

To swap in a different SentenceTransformer model, change the model string in embed_texts_sbert() and embed_query() in app.py.

ChromaDB Integration

The app builds an in-memory ChromaDB collection at query time from the stored embeddings and metadata. For a persistent HNSW index suitable for production use, the Export HNSW option in the embed stage calls _build_persistent_hnsw(), which writes a chromadb.PersistentClient database to a local directory.

Test Suite Architecture

Tests live in app.py under _run_tests(). Each test is a (name, markdown_input, config, assertion_fn) tuple. The assertion function receives the full list[Chunk] and returns (passed: bool, message: str).

The 12 built-in tests cover: flat documents with no headings, single-heading documents, three-level nesting, intro text before child headings, paragraph-level splitting with overlap, sentence-level fallback, force-splits of overlong sentences, emphasis extraction, empty headings, code block opacity, overlap correctness, and a full mixed-document smoke test.

To add a test, append a new entry to the TESTS list in _run_tests() following the existing pattern.

Contribution Guidelines

Keep chunker.py and graph.py free of any GUI, Streamlit, or vector-store imports. Both files must be importable in a plain Python environment with only their declared dependencies (markdown-it-py, networkx).
All new ChunkerConfig fields must have a docstring-style inline comment and a corresponding entry in the Advanced config tab in app.py.
New tests should be added to the TESTS list alongside existing ones — do not replace existing tests.
D3 visualisations live in dedicated _d3_* helper functions that return raw HTML strings injected via st.components.v1.html. Keep D3 code self-contained within those helpers.
Session serialisation (session_to_bytes / _trigger_session_load) must be updated whenever new persistent state fields are added.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
example_documents		example_documents
tests		tests
.gitignore		.gitignore
INFO.md		INFO.md
README.md		README.md
app.py		app.py
changelog.md.chunky		changelog.md.chunky
chunker.py		chunker.py
graph.py		graph.py
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chunky — Markdown-Aware RAG Chunker

Part I — User Guide

Running the App

What you will see

Part II — Pipeline Walkthrough

Stage 1 — Upload

Stage 2 — Configure

Stage 3 — Chunk Review

Stage 4 — Embed

Stage 5 — Knowledge Graph

Stage 6 — Query

Session Save and Load

Test Suite

Part III — Developer Reference

Repository Layout

`chunker.py` — Public API

`graph.py` — Public API

Embedding Backends

ChromaDB Integration

Test Suite Architecture

Contribution Guidelines

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Chunky — Markdown-Aware RAG Chunker

Part I — User Guide

Running the App

What you will see

Part II — Pipeline Walkthrough

Stage 1 — Upload

Stage 2 — Configure

Stage 3 — Chunk Review

Stage 4 — Embed

Stage 5 — Knowledge Graph

Stage 6 — Query

Session Save and Load

Test Suite

Part III — Developer Reference

Repository Layout

chunker.py — Public API

graph.py — Public API

Embedding Backends

ChromaDB Integration

Test Suite Architecture

Contribution Guidelines

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`chunker.py` — Public API

`graph.py` — Public API

Packages