Skip to content

effeemmeelle/Chunkdown

Repository files navigation

Chunky — Markdown-Aware RAG Chunker

A pipeline tool that turns Markdown documents into embedding-ready chunks, builds a provenance knowledge graph, and lets you query your corpus through a Streamlit dashboard. Built to be used as both a GUI and a standalone Python library.


Part I — User Guide

Running the App

Everything is handled by a single launcher script. Clone the repository, then:

python run.py

That is the entire start-up sequence. run.py will:

  1. Read requirements.txt and install all missing dependencies via pip.
  2. Create the sessions/ and tests/ folders if they do not already exist.
  3. Start the Streamlit server on http://localhost:8501 and open a browser tab automatically.

On repeat runs you can skip the install step for a faster launch:

python run.py --no-install

To use a different port (for example when 8501 is already taken):

python run.py --port 8502

What you will see

The sidebar always shows your current progress through six stages and lets you save or load a full session at any time. The main area changes to show only the controls that are relevant to the active stage. Every transition to the next stage requires an explicit button press — adjusting sliders and inputs does not trigger any computation on its own.


Part II — Pipeline Walkthrough

Stage 1 — Upload

Drag a Markdown file onto the uploader (.md only). The app immediately parses the heading structure and shows a collapsible document preview with the detected heading tree. This parse is instantaneous and requires no configuration.

Press Proceed to Configure when you are satisfied that the document was read correctly.

Stage 2 — Configure

Choose a preset to set all knobs at once:

Preset max_tokens overlap splitting strategy emphasis
Precision 200 25 % sentence greedy both
Balanced 400 20 % paragraph first both
Speed 512 10 % paragraph first strip

The Simple tab exposes the two most impactful knobs — max_tokens and overlap_pct — as sliders with live descriptions. The Advanced tab reveals every option including splitting_unit, emphasis_mode, include_breadcrumb, split_intro_separately, min_chunk_tokens, and cross_section_overlap. Each control has a one-sentence explanation shown inline.

Press Run Chunker to execute the pipeline.

Stage 3 — Chunk Review

A summary row shows total chunk count, mean / min / max token estimates, and the number of chunks that were force-split at sentence boundaries. Below it, a D3 histogram of token distribution is drawn — a red vertical line marks your max_tokens ceiling.

The full chunk table is sortable and filterable by heading level and is_section_intro. Selecting any row opens a detail panel that shows every field of the Chunk dataclass. A breadcrumb tree view mirrors the document's heading structure and lets you navigate by section.

Press Proceed to Embed to continue or Re-run with new config to change settings.

Stage 4 — Embed

Choose an embedding backend:

  • TF-IDF + LSA (default) — fully local, no downloads, runs in seconds. Produces 128-dimensional vectors using bigram TF-IDF followed by Truncated SVD. Suitable for development, testing, and smaller corpora.
  • SentenceTransformer (BGE) — downloads BAAI/bge-base-en-v1.5 on first run (~440 MB). Produces 768-dimensional vectors with substantially better semantic fidelity. Recommended for production.

A live log stream shows progress. After embedding, a UMAP 2D scatter plot is rendered via D3 — each point is a chunk, coloured by heading level, and hovering shows the breadcrumb and first 80 characters of the chunk text.

Press Proceed to Build KG to continue.

Stage 5 — Knowledge Graph

The provenance graph is built from the parsed section tree and the chunk list. It is visualised as a D3 force-directed graph inside an iframe, with a toolbar that lets you toggle node labels, hide sections or chunks, freeze the layout, and reset the view. Clicking any node shows its full attributes in a tooltip.

Edge types and their colours:

Edge Colour Meaning
contains_section blue parent section → child section
has_chunk green section node → its chunks
next_in_section amber chunk → next chunk from the same section
parent_context red (dashed) chunk → first chunk of its parent section

The parent_context edge is what makes context-expansion during retrieval possible.

Press Proceed to Query to continue.

Stage 6 — Query

Type a natural language query, set the top-k slider (1–20), and optionally enable Include parent context to follow parent_context edges and automatically add parent intro chunks alongside the top-k hits. Press Run Query to search the ChromaDB in-memory collection.

Results are displayed as ranked chunk cards, each showing the similarity score, breadcrumb path, token count, full text, and any emphasized_terms that were extracted.

Session Save and Load

Use Save session in the sidebar to download a .chunky file containing your document, configuration, chunks, embeddings (base64-encoded), and the knowledge graph. Load session restores the full state including the ChromaDB collection — no recomputation is needed. Save config only saves a lightweight JSON suitable for sharing settings without any document data.

Test Suite

The Tests tab is always accessible. Press Run all tests to execute the 12-test suite against a fixed set of Markdown inputs and expected outputs. Results are shown as a table with pass/fail status and a diff on failure.


Part III — Developer Reference

Repository Layout

chunky/
├── chunker.py        # Pure parsing and chunking logic. Zero GUI or vector-store dependencies.
├── graph.py          # Provenance KG builder. Depends only on networkx and chunker.
├── app.py            # Streamlit GUI. Orchestrates all other modules.
├── run.py            # One-click launcher. Installs deps and starts Streamlit.
├── requirements.txt  # Pinned dependency list.
├── sessions/         # Created at runtime. Stores .chunky session files.
└── tests/            # Created at runtime. Houses the test JSON fixture.

The layering rule is strict: chunker.py has no knowledge of Streamlit, ChromaDB, or sentence-transformers. graph.py has no knowledge of Streamlit or ChromaDB. app.py is the only layer that imports everything.

chunker.py — Public API

from chunker import chunk_markdown, ChunkerConfig, PRESETS

# Basic usage
chunks, tree = chunk_markdown(open("doc.md").read(), "doc.md")

# With a custom config
config = ChunkerConfig(max_tokens=300, overlap_pct=0.20, emphasis_mode="both")
chunks, tree = chunk_markdown(text, "doc.md", config)

# Using a preset
chunks, tree = chunk_markdown(text, "doc.md", PRESETS["Precision"])

ChunkerConfig fields:

Field Type Default Description
max_tokens int 400 Hard ceiling on chunk size.
min_chunk_tokens int 20 Chunks below this are discarded unless they are the only output from their section.
splitting_unit str "paragraph_first" "paragraph_first" tries to keep whole paragraphs together; "sentence_greedy" splits to sentence level immediately.
overlap_pct float 0.20 Fraction of the previous chunk's tokens to prepend as overlap. Overlap is cut at a clean sentence boundary.
cross_section_overlap bool False Allow overlap to cross section boundaries.
emphasis_mode str "both" "inline" transforms **term**term (important); "suffix" appends Key terms: …; "both" does both; "strip" removes markers.
include_breadcrumb bool True Prepend Title > H1 > H2 line to each chunk's text.
split_intro_separately bool True Emit intro text (paragraphs before child headings) as its own chunk with is_section_intro=True.

Chunk fields:

Field Type Description
id str 16-character SHA-256 of breadcrumb + raw body. Deterministic.
text str Final text for the embedder: overlap prefix + breadcrumb + body + emphasis suffix.
raw_text str Body text before any emphasis transformation.
breadcrumb list[str] Heading path from document root to this section.
level int Heading depth (0 = document root, 1 = H1, …).
is_section_intro bool True when the node has children and this chunk is the intro text.
sequence_in_section int 0-indexed position among chunks from the same section node.
overlap_prefix str Trailing text of the previous chunk prepended as context. Empty for the first chunk of a section.
emphasized_terms list[str] Bold and italic spans extracted from the raw body.
long_sentence_split bool True if a sentence had to be hard-cut at a word boundary because it exceeded max_tokens on its own.
token_estimate int Approximate token count of .text (word count × 1.3).
char_count int Character count of .text.
source_node_id str ID of the SectionNode this chunk was produced from.

graph.py — Public API

from graph import build_graph, graph_summary, expand_with_parent_context

# Build from chunker output
G = build_graph(tree, chunks)

# Summarise
print(graph_summary(G))
# {'total_nodes': 42, 'section_nodes': 10, 'chunk_nodes': 32,
#  'total_edges': 67, 'edge_types': {'contains_section': 9, ...}}

# Expand retrieved chunk IDs with parent context
enriched = expand_with_parent_context(G, top_k_ids, chunks_by_id)

Graph serialisation to JSON and back:

from graph import graph_to_dict, graph_from_dict
d = graph_to_dict(G)           # plain dict — safe to json.dumps()
G2 = graph_from_dict(d)        # reconstructs the DiGraph

Embedding Backends

The GUI supports two backends selectable at the embed stage. Both produce normalised float32 vectors stored in st.session_state.embeddings as a NumPy array of shape (n_chunks, dim).

  • TF-IDF + LSA (embed_backend = "tfidf"): bigram TF-IDF vectoriser followed by TruncatedSVD to 128 components, L2-normalised. The fitted TfidfVectorizer and TruncatedSVD objects are stored in ss.vectorizer and ss.svd so that query vectors are produced consistently.
  • SentenceTransformer (embed_backend = "sbert"): uses BAAI/bge-base-en-v1.5 via sentence_transformers. Produces 768-dimensional vectors.

To swap in a different SentenceTransformer model, change the model string in embed_texts_sbert() and embed_query() in app.py.

ChromaDB Integration

The app builds an in-memory ChromaDB collection at query time from the stored embeddings and metadata. For a persistent HNSW index suitable for production use, the Export HNSW option in the embed stage calls _build_persistent_hnsw(), which writes a chromadb.PersistentClient database to a local directory.

Test Suite Architecture

Tests live in app.py under _run_tests(). Each test is a (name, markdown_input, config, assertion_fn) tuple. The assertion function receives the full list[Chunk] and returns (passed: bool, message: str).

The 12 built-in tests cover: flat documents with no headings, single-heading documents, three-level nesting, intro text before child headings, paragraph-level splitting with overlap, sentence-level fallback, force-splits of overlong sentences, emphasis extraction, empty headings, code block opacity, overlap correctness, and a full mixed-document smoke test.

To add a test, append a new entry to the TESTS list in _run_tests() following the existing pattern.

Contribution Guidelines

  • Keep chunker.py and graph.py free of any GUI, Streamlit, or vector-store imports. Both files must be importable in a plain Python environment with only their declared dependencies (markdown-it-py, networkx).
  • All new ChunkerConfig fields must have a docstring-style inline comment and a corresponding entry in the Advanced config tab in app.py.
  • New tests should be added to the TESTS list alongside existing ones — do not replace existing tests.
  • D3 visualisations live in dedicated _d3_* helper functions that return raw HTML strings injected via st.components.v1.html. Keep D3 code self-contained within those helpers.
  • Session serialisation (session_to_bytes / _trigger_session_load) must be updated whenever new persistent state fields are added.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages