A pipeline tool that turns Markdown documents into embedding-ready chunks, builds a provenance knowledge graph, and lets you query your corpus through a Streamlit dashboard. Built to be used as both a GUI and a standalone Python library.
Everything is handled by a single launcher script. Clone the repository, then:
python run.pyThat is the entire start-up sequence. run.py will:
- Read
requirements.txtand install all missing dependencies viapip. - Create the
sessions/andtests/folders if they do not already exist. - Start the Streamlit server on
http://localhost:8501and open a browser tab automatically.
On repeat runs you can skip the install step for a faster launch:
python run.py --no-installTo use a different port (for example when 8501 is already taken):
python run.py --port 8502The sidebar always shows your current progress through six stages and lets you save or load a full session at any time. The main area changes to show only the controls that are relevant to the active stage. Every transition to the next stage requires an explicit button press — adjusting sliders and inputs does not trigger any computation on its own.
Drag a Markdown file onto the uploader (.md only). The app immediately parses the heading structure and shows a collapsible document preview with the detected heading tree. This parse is instantaneous and requires no configuration.
Press Proceed to Configure when you are satisfied that the document was read correctly.
Choose a preset to set all knobs at once:
| Preset | max_tokens | overlap | splitting strategy | emphasis |
|---|---|---|---|---|
| Precision | 200 | 25 % | sentence greedy | both |
| Balanced | 400 | 20 % | paragraph first | both |
| Speed | 512 | 10 % | paragraph first | strip |
The Simple tab exposes the two most impactful knobs — max_tokens and overlap_pct — as sliders with live descriptions. The Advanced tab reveals every option including splitting_unit, emphasis_mode, include_breadcrumb, split_intro_separately, min_chunk_tokens, and cross_section_overlap. Each control has a one-sentence explanation shown inline.
Press Run Chunker to execute the pipeline.
A summary row shows total chunk count, mean / min / max token estimates, and the number of chunks that were force-split at sentence boundaries. Below it, a D3 histogram of token distribution is drawn — a red vertical line marks your max_tokens ceiling.
The full chunk table is sortable and filterable by heading level and is_section_intro. Selecting any row opens a detail panel that shows every field of the Chunk dataclass. A breadcrumb tree view mirrors the document's heading structure and lets you navigate by section.
Press Proceed to Embed to continue or Re-run with new config to change settings.
Choose an embedding backend:
- TF-IDF + LSA (default) — fully local, no downloads, runs in seconds. Produces 128-dimensional vectors using bigram TF-IDF followed by Truncated SVD. Suitable for development, testing, and smaller corpora.
- SentenceTransformer (BGE) — downloads
BAAI/bge-base-en-v1.5on first run (~440 MB). Produces 768-dimensional vectors with substantially better semantic fidelity. Recommended for production.
A live log stream shows progress. After embedding, a UMAP 2D scatter plot is rendered via D3 — each point is a chunk, coloured by heading level, and hovering shows the breadcrumb and first 80 characters of the chunk text.
Press Proceed to Build KG to continue.
The provenance graph is built from the parsed section tree and the chunk list. It is visualised as a D3 force-directed graph inside an iframe, with a toolbar that lets you toggle node labels, hide sections or chunks, freeze the layout, and reset the view. Clicking any node shows its full attributes in a tooltip.
Edge types and their colours:
| Edge | Colour | Meaning |
|---|---|---|
contains_section |
blue | parent section → child section |
has_chunk |
green | section node → its chunks |
next_in_section |
amber | chunk → next chunk from the same section |
parent_context |
red (dashed) | chunk → first chunk of its parent section |
The parent_context edge is what makes context-expansion during retrieval possible.
Press Proceed to Query to continue.
Type a natural language query, set the top-k slider (1–20), and optionally enable Include parent context to follow parent_context edges and automatically add parent intro chunks alongside the top-k hits. Press Run Query to search the ChromaDB in-memory collection.
Results are displayed as ranked chunk cards, each showing the similarity score, breadcrumb path, token count, full text, and any emphasized_terms that were extracted.
Use Save session in the sidebar to download a .chunky file containing your document, configuration, chunks, embeddings (base64-encoded), and the knowledge graph. Load session restores the full state including the ChromaDB collection — no recomputation is needed. Save config only saves a lightweight JSON suitable for sharing settings without any document data.
The Tests tab is always accessible. Press Run all tests to execute the 12-test suite against a fixed set of Markdown inputs and expected outputs. Results are shown as a table with pass/fail status and a diff on failure.
chunky/
├── chunker.py # Pure parsing and chunking logic. Zero GUI or vector-store dependencies.
├── graph.py # Provenance KG builder. Depends only on networkx and chunker.
├── app.py # Streamlit GUI. Orchestrates all other modules.
├── run.py # One-click launcher. Installs deps and starts Streamlit.
├── requirements.txt # Pinned dependency list.
├── sessions/ # Created at runtime. Stores .chunky session files.
└── tests/ # Created at runtime. Houses the test JSON fixture.
The layering rule is strict: chunker.py has no knowledge of Streamlit, ChromaDB, or sentence-transformers. graph.py has no knowledge of Streamlit or ChromaDB. app.py is the only layer that imports everything.
from chunker import chunk_markdown, ChunkerConfig, PRESETS
# Basic usage
chunks, tree = chunk_markdown(open("doc.md").read(), "doc.md")
# With a custom config
config = ChunkerConfig(max_tokens=300, overlap_pct=0.20, emphasis_mode="both")
chunks, tree = chunk_markdown(text, "doc.md", config)
# Using a preset
chunks, tree = chunk_markdown(text, "doc.md", PRESETS["Precision"])ChunkerConfig fields:
| Field | Type | Default | Description |
|---|---|---|---|
max_tokens |
int | 400 | Hard ceiling on chunk size. |
min_chunk_tokens |
int | 20 | Chunks below this are discarded unless they are the only output from their section. |
splitting_unit |
str | "paragraph_first" |
"paragraph_first" tries to keep whole paragraphs together; "sentence_greedy" splits to sentence level immediately. |
overlap_pct |
float | 0.20 | Fraction of the previous chunk's tokens to prepend as overlap. Overlap is cut at a clean sentence boundary. |
cross_section_overlap |
bool | False | Allow overlap to cross section boundaries. |
emphasis_mode |
str | "both" |
"inline" transforms **term** → term (important); "suffix" appends Key terms: …; "both" does both; "strip" removes markers. |
include_breadcrumb |
bool | True | Prepend Title > H1 > H2 line to each chunk's text. |
split_intro_separately |
bool | True | Emit intro text (paragraphs before child headings) as its own chunk with is_section_intro=True. |
Chunk fields:
| Field | Type | Description |
|---|---|---|
id |
str | 16-character SHA-256 of breadcrumb + raw body. Deterministic. |
text |
str | Final text for the embedder: overlap prefix + breadcrumb + body + emphasis suffix. |
raw_text |
str | Body text before any emphasis transformation. |
breadcrumb |
list[str] | Heading path from document root to this section. |
level |
int | Heading depth (0 = document root, 1 = H1, …). |
is_section_intro |
bool | True when the node has children and this chunk is the intro text. |
sequence_in_section |
int | 0-indexed position among chunks from the same section node. |
overlap_prefix |
str | Trailing text of the previous chunk prepended as context. Empty for the first chunk of a section. |
emphasized_terms |
list[str] | Bold and italic spans extracted from the raw body. |
long_sentence_split |
bool | True if a sentence had to be hard-cut at a word boundary because it exceeded max_tokens on its own. |
token_estimate |
int | Approximate token count of .text (word count × 1.3). |
char_count |
int | Character count of .text. |
source_node_id |
str | ID of the SectionNode this chunk was produced from. |
from graph import build_graph, graph_summary, expand_with_parent_context
# Build from chunker output
G = build_graph(tree, chunks)
# Summarise
print(graph_summary(G))
# {'total_nodes': 42, 'section_nodes': 10, 'chunk_nodes': 32,
# 'total_edges': 67, 'edge_types': {'contains_section': 9, ...}}
# Expand retrieved chunk IDs with parent context
enriched = expand_with_parent_context(G, top_k_ids, chunks_by_id)Graph serialisation to JSON and back:
from graph import graph_to_dict, graph_from_dict
d = graph_to_dict(G) # plain dict — safe to json.dumps()
G2 = graph_from_dict(d) # reconstructs the DiGraphThe GUI supports two backends selectable at the embed stage. Both produce normalised float32 vectors stored in st.session_state.embeddings as a NumPy array of shape (n_chunks, dim).
- TF-IDF + LSA (
embed_backend = "tfidf"): bigram TF-IDF vectoriser followed by TruncatedSVD to 128 components, L2-normalised. The fittedTfidfVectorizerandTruncatedSVDobjects are stored inss.vectorizerandss.svdso that query vectors are produced consistently. - SentenceTransformer (
embed_backend = "sbert"): usesBAAI/bge-base-en-v1.5viasentence_transformers. Produces 768-dimensional vectors.
To swap in a different SentenceTransformer model, change the model string in embed_texts_sbert() and embed_query() in app.py.
The app builds an in-memory ChromaDB collection at query time from the stored embeddings and metadata. For a persistent HNSW index suitable for production use, the Export HNSW option in the embed stage calls _build_persistent_hnsw(), which writes a chromadb.PersistentClient database to a local directory.
Tests live in app.py under _run_tests(). Each test is a (name, markdown_input, config, assertion_fn) tuple. The assertion function receives the full list[Chunk] and returns (passed: bool, message: str).
The 12 built-in tests cover: flat documents with no headings, single-heading documents, three-level nesting, intro text before child headings, paragraph-level splitting with overlap, sentence-level fallback, force-splits of overlong sentences, emphasis extraction, empty headings, code block opacity, overlap correctness, and a full mixed-document smoke test.
To add a test, append a new entry to the TESTS list in _run_tests() following the existing pattern.
- Keep
chunker.pyandgraph.pyfree of any GUI, Streamlit, or vector-store imports. Both files must be importable in a plain Python environment with only their declared dependencies (markdown-it-py,networkx). - All new
ChunkerConfigfields must have a docstring-style inline comment and a corresponding entry in the Advanced config tab inapp.py. - New tests should be added to the
TESTSlist alongside existing ones — do not replace existing tests. - D3 visualisations live in dedicated
_d3_*helper functions that return raw HTML strings injected viast.components.v1.html. Keep D3 code self-contained within those helpers. - Session serialisation (
session_to_bytes/_trigger_session_load) must be updated whenever new persistent state fields are added.