aaaronmiller · kilo-code-bot · Feb 15, 2026
diff --git a/scratch/01-embedding-model-stack.md b/scratch/01-embedding-model-stack.md
@@ -0,0 +1,187 @@
+# Topic: Embedding Model Stack
+
+## Summary
+Configuration, initialization, and integration of the Qwen3-VL multimodal embedding models and reranker for the RAG v3.0 system.
+
+---
+
+## Primary Model: Qwen3-VL-Embedding-8B
+
+> **Source:** [opus-prd1-v3.md](../opus-prd1-v3.md), [opus-prd2-v3.md](../opus-prd2-v3.md), [opus-prd3-v3.md](../opus-prd3-v3.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md)
+
+- **Model ID:** `Qwen/Qwen3-VL-Embedding-8B`
+- **Released:** January 7-8, 2026 (arXiv:2601.04720)
+- **Parameters:** 8.14B
+- **Layers:** 36
+- **Architecture:** Dual-Tower (qwen3_vl)
+- **Context Length:** 32,768 tokens (default 8,192)
+- **Native Embedding Dimensions:** 4096
+- **MRL Support:** Yes — options: [256, 512, 1024, 2048, 4096]
+  - Storage dimension: 1024 (truncated for MemVid efficiency)
+  - Retrieval dimension: 2048 (higher precision for queries)
+- **Quantization:** bf16 (recommended), fp16, int8, int4
+- **Instruction-Aware:** Yes
+
+### Benchmarks
+| Benchmark | Score |
+|-----------|-------|
+| MMEB-V2 | 77.8 (Rank #1) |
+| MMTEB | 67.88 |
+| Image Retrieval | 80.0 |
+| Video Retrieval | 67.1 |
+| VisDoc Retrieval | 82.4 |
+
+### Supported Input Modalities
+- Pure text
+- Pure image
+- Pure video
+- Text + image (mixed)
+- Text + video (mixed)
+- Image + video (mixed)
+- Text + image + video (mixed)
+- Screenshots (treated as images with OCR awareness)
+
+### Vision Configuration
+> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md)
+
+- `min_pixels`: 4096
+- `max_pixels`: 1,843,200 (1280×1440)
+- `total_video_pixels`: 7,864,320
+- `default_fps`: 1.0
+- `default_frames`: 64
+- `max_frames`: 64
+
+### Inference Configuration
+> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md)
+
+- `torch_dtype`: bfloat16
+- `attn_implementation`: flash_attention_2
+- `device_map`: auto
+
+### Architecture Details
+> **Source:** [opus-prd3-v3.md](../opus-prd3-v3.md)
+
+- Extracts `[EOS]` token hidden state from last layer as final representation
+- Cross-modal pretraining with unified modality projection
+- Integrates supervised tasks, masked modeling, and multimodal alignment objectives
+- Enables efficient independent encoding for large-scale retrieval
+
+---
+
+## Boundary Detection Model: Qwen3-Embedding-0.6B
+
+> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md)
+
+- **Model ID:** `Qwen/Qwen3-Embedding-0.6B`
+- **Type:** Text-only
+- **Parameters:** 595.8M
+- **Native Dimensions:** 1024
+- **Purpose:** Cheap/fast similarity detection for semantic chunking boundary detection
+
+---
+
+## Reranker: Qwen3-VL-Reranker-8B
+
+> **Source:** [opus-prd1-v3.md](../opus-prd1-v3.md), [opus-prd2-v3.md](../opus-prd2-v3.md)
+
+- **Model ID:** `Qwen/Qwen3-VL-Reranker-8B`
+- **Parameters:** 8.14B
+- **Layers:** 36
+- **Architecture:** Single-Tower with Cross-Attention
+- **Input:** (Query, Document) pairs — both can be mixed-modal
+- **Output:** Relevance score (via yes/no token generation probability)
+- **Supported Modalities:** text, image, video, mixed
+- **Inference:** bfloat16, flash_attention_2
+
+### Smaller Variant: Qwen3-VL-Reranker-2B
+- **Parameters:** 2.13B
+- Same architecture (Single-Tower)
+
+---
+
+## Fallback Model: Qwen3-Embedding-8B (Text-Only)
+
+> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md)
+
+- **Model ID:** `Qwen/Qwen3-Embedding-8B`
+- **Type:** Text-only
+- **Parameters:** 7.57B
+- **Native Dimensions:** 4096
+- **MTEB Score:** 70.58 (Rank #1)
+- **Note:** Higher MTEB score than VL model (70.58 vs 67.88) but lacks multimodal capabilities
+
+---
+
+## Model Initialization Code
+
+> **Source:** [chatgpt5.2-prd.md](../chatgpt5.2-prd.md) (Phase 1), [opus-prd1-v3.md](../opus-prd1-v3.md)
+
+```python
+import torch
+from src.models.qwen3_vl_embedding import Qwen3VLEmbedder
+from src.models.qwen3_vl_reranker import Qwen3VLReranker
+
+# Primary Embedding Model
+embedder = Qwen3VLEmbedder(
+    model_name_or_path="Qwen/Qwen3-VL-Embedding-8B",
+    max_length=8192,
+    min_pixels=4096,
+    max_pixels=1843200,
+    total_pixels=7864320,
+    fps=1.0,
+    num_frames=64,
+    max_frames=64,
+    torch_dtype=torch.bfloat16,
+    attn_implementation="flash_attention_2"
+)
+
+# Precision Reranker
+reranker = Qwen3VLReranker(
+    model_name_or_path="Qwen/Qwen3-VL-Reranker-8B",
+    torch_dtype=torch.bfloat16,
+    attn_implementation="flash_attention_2"
+)
+```
+
+---
+
+## Alternative Models Considered
+
+> **Source:** [chatgpt5.2-prd.md](../chatgpt5.2-prd.md)
+
+- **Gemini Text-Embedding-001:** Upcoming model (replacing text-embedding-004), expected January 16, 2026. Considered as alternative/complement.
+- **Qwen3-VL-Embedding-2B:** Lightweight variant (2.13B params, 2048 dims, MMEB-V2: 73.2)
+
+---
+
+## Cost Analysis
+
+> **Source:** [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md), [chatgpt5.2-prd.md](../chatgpt5.2-prd.md)
+
+| Component | Model | Cost |
+|-----------|-------|------|
+| Embedding | Qwen3-VL-Embedding-8B | ~$0.03/1M tokens* |
+| Reranking | Qwen3-VL-Reranker-8B | ~$0.05/1M tokens* |
+| Ingestion (35MB) | One-time | ~$0.10 |
+| Queries (10K/day, annual) | - | ~$5.00 |
+
+*Estimated — not yet on OpenRouter, requires self-hosting or wait for API availability.
+
+---
+
+## Implementation Requirements
+
+1. Set up Qwen3-VL-Embedding-8B environment with flash_attention_2
+2. Implement model wrapper classes (`Qwen3VLEmbedder`, `Qwen3VLReranker`)
+3. Support MRL dimension truncation for storage vs retrieval
+4. Implement multimodal input preprocessing (text, image, video, mixed)
+5. Add fallback to text-only model on multimodal failure
+6. Integrate with OpenRouter for remote inference
+
+---
+
+## Conflicts / Ambiguities
+
+- **⚠️ Dimension mismatch:** chatgpt5.2-prd.md mentions "1526 or 3746 or 3182" as possible embedding sizes — these don't match the actual Qwen3-VL dimensions (4096 native, MRL options: 256/512/1024/2048/4096). The opus PRDs provide the correct values.
+- **⚠️ Gemini alternative:** chatgpt5.2-prd.md suggests potentially using both Qwen and Gemini embeddings. No other document addresses dual-embedding strategy.
+- **⚠️ Hosting:** chatgpt5.2-prd.md assumes OpenRouter availability; cost estimates are speculative since the model may require self-hosting.
diff --git a/scratch/02-chunking-strategies.md b/scratch/02-chunking-strategies.md
@@ -0,0 +1,202 @@
+# Topic: Chunking Strategies
+
+## Summary
+Seven distinct chunking methods for processing different content types (text, code, mixed-modal) into the RAG system. Includes configuration, routing logic, and the four-layer epistemic scaffolding model.
+
+---
+
+## Conceptual Framework: Four-Layer Epistemic Scaffolding
+
+> **Source:** [opus-prd3-v3.md](../opus-prd3-v3.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md)
+
+Chunking has evolved into a four-layer system:
+1. **Fixed-length chunking** — mechanical, deterministic
+2. **Sentence/semantic-unit chunking** — linguistic awareness
+3. **Semantic coherence chunking (agentic)** — meaning-aware boundaries
+4. **Recursive hierarchical chunking (agentic)** — document-structure-aware
+
+---
+
+## Method 1: Fixed-Size Chunking
+
+> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md), [chatgpt5.2-prd.md](../chatgpt5.2-prd.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md)
+
+- **Window tokens:** 512
+- **Overlap tokens:** 50
+- **Applies to:** configuration files, data files
+- **Modalities:** text only
+- **Agent required:** No — can be done programmatically
+
+### Implementation Notes
+> **Source:** [chatgpt5.2-prd.md](../chatgpt5.2-prd.md)
+
+- Length-based chunking can be done programmatically without an LLM agent
+- Simplest method, serves as fallback for AST chunking failures
+
+---
+
+## Method 2: Sentence-Based Chunking
+
+> **Source:** [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md), [chatgpt5.2-prd.md](../chatgpt5.2-prd.md)
+
+- **Window size:** 3 sentences
+- **Min chunk tokens:** 128
+- **Max chunk tokens:** 2048
+- **Agent required:** No — can be done programmatically
+
+---
+
+## Method 3: Semantic Chunking (Agentic)
+
+> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md), [chatgpt5.2-prd.md](../chatgpt5.2-prd.md)
+
+- **Similarity threshold:** 0.75
+- **Window size:** 3 sentences
+- **Boundary detection model:** `Qwen/Qwen3-Embedding-0.6B`
+- **Min chunk tokens:** 128
+- **Max chunk tokens:** 2048
+- **Applies to:** documentation, research papers
+- **Modalities:** text only
+- **Agent required:** Yes — requires intelligence for boundary detection
+
+### How It Works
+> **Source:** [opus-prd3-v3.md](../opus-prd3-v3.md)
+
+Uses embedding similarity between adjacent sentence windows to detect topic shifts. When similarity drops below threshold (0.75), a chunk boundary is placed. The lightweight 0.6B model handles boundary detection cheaply.
+
+---
+
+## Method 4: Recursive Hierarchical Chunking (Agentic)
+
+> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md), [chatgpt5.2-prd.md](../chatgpt5.2-prd.md)
+
+- **Chunk size tokens:** 1024
+- **Overlap tokens:** 100
+- **Separators** (in priority order):
+  1. `"\n\n"` — Paragraphs
+  2. `"\n"` — Lines
+  3. `". "` — Sentences
+  4. `" "` — Words (last resort)
+- **Applies to:** documentation, conversation
+- **Modalities:** text only
+- **Agent required:** Yes — requires understanding of document structure
+
+---
+
+## Method 5: AST Structural Chunking (Code)
+
+> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md)
+
+### Supported Languages & Parsers
+
+| Language | Parser | AST Nodes |
+|----------|--------|-----------|
+| Python | tree-sitter-python | function_definition, class_definition, decorated_definition |
+| TypeScript | tree-sitter-typescript | function_declaration, class_declaration, method_definition, interface_declaration |
+| JavaScript | tree-sitter-javascript | function_declaration, class_declaration, method_definition |
+| Go | tree-sitter-go | function_declaration, method_declaration, type_declaration |
+| Rust | tree-sitter-rust | function_item, impl_item, struct_item, trait_item |
+| Java | tree-sitter-java | method_declaration, class_declaration, constructor_declaration, interface_declaration |
+
+### Configuration
+- `prepend_parent_context`: true
+- `preserve_docstrings`: true
+- `preserve_imports`: true
+- `extract_dependencies`: true
+- `compute_complexity`: true
+- `fallback_to_fixed`: true (falls back to fixed-size 512 tokens on parse failure)
+
+### Applies to
+- Content types: code
+- Modalities: text
+
+---
+
+## Method 6: Multimodal Boundary Detection (NEW)
+
+> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md)
+
+- **Visual context window:** 1 paragraph before/after
+- **Caption detection:** true
+- **Figure reference detection:** true
+- **Preserve figure-caption pairs:** true
+- **Applies to:** documentation, research papers
+- **Modalities:** mixed_text_image, mixed_all
+
+### Purpose
+Detects boundaries between text and visual content in mixed documents. Ensures figures, diagrams, and their captions are kept together as coherent chunks.
+
+---
+
+## Method 7: Screenshot-Code Fusion (NEW)
+
+> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md)
+
+- **Matching strategies:**
+  - `filename_similarity` — match screenshots to code files by name
+  - `ocr_text_matching` — extract text from screenshots, match to code
+  - `reference_comment_detection` — find code comments referencing screenshots
+- **Applies to:** code
+- **Modalities:** mixed_text_image
+
+### Purpose
+Fuses UI screenshots with the code that generates them, creating cross-modal chunks that link visual output to source code.
+
+---
+
+## Content Type → Chunking Method Routing
+
+> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md) (domains section)
+
+| Domain | Chunking Methods |
+|--------|-----------------|
+| Prompts | semantic, fixed_size |
+| Codebase | ast_structural, fixed_size, screenshot_code_fusion |
+| Research | recursive_hierarchical, semantic, multimodal_boundary |
+
+---
+
+## Asynchronous / Multi-Agent Chunking
+
+> **Source:** [chatgpt5.2-prd.md](../chatgpt5.2-prd.md)
+
+### Agent Assignment by Method
+- **Fixed-size & Sentence-based:** Programmatic (no LLM needed)
+- **Semantic chunking:** Requires LLM intelligence — can use Haiku/Flash-class model
+- **Recursive hierarchical:** Requires higher intelligence — Sonnet/Pro-class model recommended
+
+### Key Questions from Requirements
+- Can semantic and recursive hierarchical chunking be done in a single pass by one agent, or do they require separate passes?
+- The user suggests asynchronous processing across files is ideal given the multi-file corpus
+
+### Model Recommendations for Agentic Chunking
+> **Source:** [chatgpt5.2-prd.md](../chatgpt5.2-prd.md)
+
+- Sonnet/Gemini Pro class: For recursive hierarchical chunking
+- Haiku/Gemini Flash class: For semantic chunking
+- Free models (e.g., MIMO V2 via OpenRouter): For simpler tasks
+
+---
+
+## Quad Encoding (MemVid-Specific Chunking)
+
+> **Source:** [gemini-prd.md](../gemini-prd.md) (Appendix I)
+
+MemVid uses "Quad Encoding" — encoding the same content at four resolutions:
+
+| Resolution | What it Encodes | Agent Query Type |
+|-----------|----------------|-----------------|
+| Word (Token) | Keywords & Entities | Exact definitions, variable names |
+| Sentence | Discrete Facts | Return types, specific error codes |
+| Paragraph | Local Context | How a flow handles edge cases |
+| Boundary | Relationships & Flow | What connects between sections |
+
+This is done during sleep-time compute (not real-time) due to 4x embedding cost.
+
+---
+
+## Conflicts / Ambiguities
+
+- **⚠️ Chunk size inconsistency:** chatgpt5.2-prd.md mentions "1.5-3K tokens" for chunks; opus-prd2-v3.md specifies 512 tokens (fixed), 1024 tokens (recursive), 128-2048 tokens (semantic). The AGGREGATION_PLAN.md lists yet another set: "1.5-3k tokens with 200-400 token overlap" for fixed-size. The opus-prd2 YAML config should be treated as authoritative.
+- **⚠️ Number of methods:** UNIFIED_PRD.md lists 7 methods; chatgpt5.2-prd.md discusses 4 core methods; opus-prd2-v3.md defines 6 in YAML config. The 7-method list (adding sentence-based as distinct from semantic) is the most complete.
+- **⚠️ Agentic vs programmatic:** chatgpt5.2-prd.md suggests semantic and recursive hierarchical need LLM agents; opus-prd2-v3.md treats semantic chunking as algorithmic (embedding similarity threshold). Resolution: semantic chunking uses the lightweight 0.6B model algorithmically, not a full LLM agent.