Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
187 changes: 187 additions & 0 deletions scratch/01-embedding-model-stack.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
# Topic: Embedding Model Stack

## Summary
Configuration, initialization, and integration of the Qwen3-VL multimodal embedding models and reranker for the RAG v3.0 system.

---

## Primary Model: Qwen3-VL-Embedding-8B

> **Source:** [opus-prd1-v3.md](../opus-prd1-v3.md), [opus-prd2-v3.md](../opus-prd2-v3.md), [opus-prd3-v3.md](../opus-prd3-v3.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md)

- **Model ID:** `Qwen/Qwen3-VL-Embedding-8B`
- **Released:** January 7-8, 2026 (arXiv:2601.04720)
- **Parameters:** 8.14B
- **Layers:** 36
- **Architecture:** Dual-Tower (qwen3_vl)
- **Context Length:** 32,768 tokens (default 8,192)
- **Native Embedding Dimensions:** 4096
- **MRL Support:** Yes — options: [256, 512, 1024, 2048, 4096]
- Storage dimension: 1024 (truncated for MemVid efficiency)
- Retrieval dimension: 2048 (higher precision for queries)
- **Quantization:** bf16 (recommended), fp16, int8, int4
- **Instruction-Aware:** Yes

### Benchmarks
| Benchmark | Score |
|-----------|-------|
| MMEB-V2 | 77.8 (Rank #1) |
| MMTEB | 67.88 |
| Image Retrieval | 80.0 |
| Video Retrieval | 67.1 |
| VisDoc Retrieval | 82.4 |

### Supported Input Modalities
- Pure text
- Pure image
- Pure video
- Text + image (mixed)
- Text + video (mixed)
- Image + video (mixed)
- Text + image + video (mixed)
- Screenshots (treated as images with OCR awareness)

### Vision Configuration
> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md)

- `min_pixels`: 4096
- `max_pixels`: 1,843,200 (1280×1440)
- `total_video_pixels`: 7,864,320
- `default_fps`: 1.0
- `default_frames`: 64
- `max_frames`: 64

### Inference Configuration
> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md)

- `torch_dtype`: bfloat16
- `attn_implementation`: flash_attention_2
- `device_map`: auto

### Architecture Details
> **Source:** [opus-prd3-v3.md](../opus-prd3-v3.md)

- Extracts `[EOS]` token hidden state from last layer as final representation
- Cross-modal pretraining with unified modality projection
- Integrates supervised tasks, masked modeling, and multimodal alignment objectives
- Enables efficient independent encoding for large-scale retrieval

---

## Boundary Detection Model: Qwen3-Embedding-0.6B

> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md)

- **Model ID:** `Qwen/Qwen3-Embedding-0.6B`
- **Type:** Text-only
- **Parameters:** 595.8M
- **Native Dimensions:** 1024
- **Purpose:** Cheap/fast similarity detection for semantic chunking boundary detection

---

## Reranker: Qwen3-VL-Reranker-8B

> **Source:** [opus-prd1-v3.md](../opus-prd1-v3.md), [opus-prd2-v3.md](../opus-prd2-v3.md)

- **Model ID:** `Qwen/Qwen3-VL-Reranker-8B`
- **Parameters:** 8.14B
- **Layers:** 36
- **Architecture:** Single-Tower with Cross-Attention
- **Input:** (Query, Document) pairs — both can be mixed-modal
- **Output:** Relevance score (via yes/no token generation probability)
- **Supported Modalities:** text, image, video, mixed
- **Inference:** bfloat16, flash_attention_2

### Smaller Variant: Qwen3-VL-Reranker-2B
- **Parameters:** 2.13B
- Same architecture (Single-Tower)

---

## Fallback Model: Qwen3-Embedding-8B (Text-Only)

> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md)

- **Model ID:** `Qwen/Qwen3-Embedding-8B`
- **Type:** Text-only
- **Parameters:** 7.57B
- **Native Dimensions:** 4096
- **MTEB Score:** 70.58 (Rank #1)
- **Note:** Higher MTEB score than VL model (70.58 vs 67.88) but lacks multimodal capabilities

---

## Model Initialization Code

> **Source:** [chatgpt5.2-prd.md](../chatgpt5.2-prd.md) (Phase 1), [opus-prd1-v3.md](../opus-prd1-v3.md)

```python
import torch
from src.models.qwen3_vl_embedding import Qwen3VLEmbedder
from src.models.qwen3_vl_reranker import Qwen3VLReranker

# Primary Embedding Model
embedder = Qwen3VLEmbedder(
model_name_or_path="Qwen/Qwen3-VL-Embedding-8B",
max_length=8192,
min_pixels=4096,
max_pixels=1843200,
total_pixels=7864320,
fps=1.0,
num_frames=64,
max_frames=64,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2"
)

# Precision Reranker
reranker = Qwen3VLReranker(
model_name_or_path="Qwen/Qwen3-VL-Reranker-8B",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2"
)
```

---

## Alternative Models Considered

> **Source:** [chatgpt5.2-prd.md](../chatgpt5.2-prd.md)

- **Gemini Text-Embedding-001:** Upcoming model (replacing text-embedding-004), expected January 16, 2026. Considered as alternative/complement.
- **Qwen3-VL-Embedding-2B:** Lightweight variant (2.13B params, 2048 dims, MMEB-V2: 73.2)

---

## Cost Analysis

> **Source:** [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md), [chatgpt5.2-prd.md](../chatgpt5.2-prd.md)

| Component | Model | Cost |
|-----------|-------|------|
| Embedding | Qwen3-VL-Embedding-8B | ~$0.03/1M tokens* |
| Reranking | Qwen3-VL-Reranker-8B | ~$0.05/1M tokens* |
| Ingestion (35MB) | One-time | ~$0.10 |
| Queries (10K/day, annual) | - | ~$5.00 |

*Estimated — not yet on OpenRouter, requires self-hosting or wait for API availability.

---

## Implementation Requirements

1. Set up Qwen3-VL-Embedding-8B environment with flash_attention_2
2. Implement model wrapper classes (`Qwen3VLEmbedder`, `Qwen3VLReranker`)
3. Support MRL dimension truncation for storage vs retrieval
4. Implement multimodal input preprocessing (text, image, video, mixed)
5. Add fallback to text-only model on multimodal failure
6. Integrate with OpenRouter for remote inference

---

## Conflicts / Ambiguities

- **⚠️ Dimension mismatch:** chatgpt5.2-prd.md mentions "1526 or 3746 or 3182" as possible embedding sizes — these don't match the actual Qwen3-VL dimensions (4096 native, MRL options: 256/512/1024/2048/4096). The opus PRDs provide the correct values.
- **⚠️ Gemini alternative:** chatgpt5.2-prd.md suggests potentially using both Qwen and Gemini embeddings. No other document addresses dual-embedding strategy.
- **⚠️ Hosting:** chatgpt5.2-prd.md assumes OpenRouter availability; cost estimates are speculative since the model may require self-hosting.
202 changes: 202 additions & 0 deletions scratch/02-chunking-strategies.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
# Topic: Chunking Strategies

## Summary
Seven distinct chunking methods for processing different content types (text, code, mixed-modal) into the RAG system. Includes configuration, routing logic, and the four-layer epistemic scaffolding model.

---

## Conceptual Framework: Four-Layer Epistemic Scaffolding

> **Source:** [opus-prd3-v3.md](../opus-prd3-v3.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md)

Chunking has evolved into a four-layer system:
1. **Fixed-length chunking** — mechanical, deterministic
2. **Sentence/semantic-unit chunking** — linguistic awareness
3. **Semantic coherence chunking (agentic)** — meaning-aware boundaries
4. **Recursive hierarchical chunking (agentic)** — document-structure-aware

---

## Method 1: Fixed-Size Chunking

> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md), [chatgpt5.2-prd.md](../chatgpt5.2-prd.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md)

- **Window tokens:** 512
- **Overlap tokens:** 50
- **Applies to:** configuration files, data files
- **Modalities:** text only
- **Agent required:** No — can be done programmatically

### Implementation Notes
> **Source:** [chatgpt5.2-prd.md](../chatgpt5.2-prd.md)

- Length-based chunking can be done programmatically without an LLM agent
- Simplest method, serves as fallback for AST chunking failures

---

## Method 2: Sentence-Based Chunking

> **Source:** [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md), [chatgpt5.2-prd.md](../chatgpt5.2-prd.md)

- **Window size:** 3 sentences
- **Min chunk tokens:** 128
- **Max chunk tokens:** 2048
- **Agent required:** No — can be done programmatically

---

## Method 3: Semantic Chunking (Agentic)

> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md), [chatgpt5.2-prd.md](../chatgpt5.2-prd.md)

- **Similarity threshold:** 0.75
- **Window size:** 3 sentences
- **Boundary detection model:** `Qwen/Qwen3-Embedding-0.6B`
- **Min chunk tokens:** 128
- **Max chunk tokens:** 2048
- **Applies to:** documentation, research papers
- **Modalities:** text only
- **Agent required:** Yes — requires intelligence for boundary detection

### How It Works
> **Source:** [opus-prd3-v3.md](../opus-prd3-v3.md)

Uses embedding similarity between adjacent sentence windows to detect topic shifts. When similarity drops below threshold (0.75), a chunk boundary is placed. The lightweight 0.6B model handles boundary detection cheaply.

---

## Method 4: Recursive Hierarchical Chunking (Agentic)

> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md), [chatgpt5.2-prd.md](../chatgpt5.2-prd.md)

- **Chunk size tokens:** 1024
- **Overlap tokens:** 100
- **Separators** (in priority order):
1. `"\n\n"` — Paragraphs
2. `"\n"` — Lines
3. `". "` — Sentences
4. `" "` — Words (last resort)
- **Applies to:** documentation, conversation
- **Modalities:** text only
- **Agent required:** Yes — requires understanding of document structure

---

## Method 5: AST Structural Chunking (Code)

> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md)

### Supported Languages & Parsers

| Language | Parser | AST Nodes |
|----------|--------|-----------|
| Python | tree-sitter-python | function_definition, class_definition, decorated_definition |
| TypeScript | tree-sitter-typescript | function_declaration, class_declaration, method_definition, interface_declaration |
| JavaScript | tree-sitter-javascript | function_declaration, class_declaration, method_definition |
| Go | tree-sitter-go | function_declaration, method_declaration, type_declaration |
| Rust | tree-sitter-rust | function_item, impl_item, struct_item, trait_item |
| Java | tree-sitter-java | method_declaration, class_declaration, constructor_declaration, interface_declaration |

### Configuration
- `prepend_parent_context`: true
- `preserve_docstrings`: true
- `preserve_imports`: true
- `extract_dependencies`: true
- `compute_complexity`: true
- `fallback_to_fixed`: true (falls back to fixed-size 512 tokens on parse failure)

### Applies to
- Content types: code
- Modalities: text

---

## Method 6: Multimodal Boundary Detection (NEW)

> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md)

- **Visual context window:** 1 paragraph before/after
- **Caption detection:** true
- **Figure reference detection:** true
- **Preserve figure-caption pairs:** true
- **Applies to:** documentation, research papers
- **Modalities:** mixed_text_image, mixed_all

### Purpose
Detects boundaries between text and visual content in mixed documents. Ensures figures, diagrams, and their captions are kept together as coherent chunks.

---

## Method 7: Screenshot-Code Fusion (NEW)

> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md)

- **Matching strategies:**
- `filename_similarity` — match screenshots to code files by name
- `ocr_text_matching` — extract text from screenshots, match to code
- `reference_comment_detection` — find code comments referencing screenshots
- **Applies to:** code
- **Modalities:** mixed_text_image

### Purpose
Fuses UI screenshots with the code that generates them, creating cross-modal chunks that link visual output to source code.

---

## Content Type → Chunking Method Routing

> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md) (domains section)

| Domain | Chunking Methods |
|--------|-----------------|
| Prompts | semantic, fixed_size |
| Codebase | ast_structural, fixed_size, screenshot_code_fusion |
| Research | recursive_hierarchical, semantic, multimodal_boundary |

---

## Asynchronous / Multi-Agent Chunking

> **Source:** [chatgpt5.2-prd.md](../chatgpt5.2-prd.md)

### Agent Assignment by Method
- **Fixed-size & Sentence-based:** Programmatic (no LLM needed)
- **Semantic chunking:** Requires LLM intelligence — can use Haiku/Flash-class model
- **Recursive hierarchical:** Requires higher intelligence — Sonnet/Pro-class model recommended

### Key Questions from Requirements
- Can semantic and recursive hierarchical chunking be done in a single pass by one agent, or do they require separate passes?
- The user suggests asynchronous processing across files is ideal given the multi-file corpus

### Model Recommendations for Agentic Chunking
> **Source:** [chatgpt5.2-prd.md](../chatgpt5.2-prd.md)

- Sonnet/Gemini Pro class: For recursive hierarchical chunking
- Haiku/Gemini Flash class: For semantic chunking
- Free models (e.g., MIMO V2 via OpenRouter): For simpler tasks

---

## Quad Encoding (MemVid-Specific Chunking)

> **Source:** [gemini-prd.md](../gemini-prd.md) (Appendix I)

MemVid uses "Quad Encoding" — encoding the same content at four resolutions:

| Resolution | What it Encodes | Agent Query Type |
|-----------|----------------|-----------------|
| Word (Token) | Keywords & Entities | Exact definitions, variable names |
| Sentence | Discrete Facts | Return types, specific error codes |
| Paragraph | Local Context | How a flow handles edge cases |
| Boundary | Relationships & Flow | What connects between sections |

This is done during sleep-time compute (not real-time) due to 4x embedding cost.

---

## Conflicts / Ambiguities

- **⚠️ Chunk size inconsistency:** chatgpt5.2-prd.md mentions "1.5-3K tokens" for chunks; opus-prd2-v3.md specifies 512 tokens (fixed), 1024 tokens (recursive), 128-2048 tokens (semantic). The AGGREGATION_PLAN.md lists yet another set: "1.5-3k tokens with 200-400 token overlap" for fixed-size. The opus-prd2 YAML config should be treated as authoritative.
- **⚠️ Number of methods:** UNIFIED_PRD.md lists 7 methods; chatgpt5.2-prd.md discusses 4 core methods; opus-prd2-v3.md defines 6 in YAML config. The 7-method list (adding sentence-based as distinct from semantic) is the most complete.
- **⚠️ Agentic vs programmatic:** chatgpt5.2-prd.md suggests semantic and recursive hierarchical need LLM agents; opus-prd2-v3.md treats semantic chunking as algorithmic (embedding similarity threshold). Resolution: semantic chunking uses the lightweight 0.6B model algorithmically, not a full LLM agent.
Loading