From bb15d3e2e961e18591e346bd9d1d4ffe496a5474 Mon Sep 17 00:00:00 2001 From: "kiloconnect[bot]" <240665456+kiloconnect[bot]@users.noreply.github.com> Date: Sun, 15 Feb 2026 06:18:00 +0000 Subject: [PATCH] Add scratch files: decompose project docs into 13 implementation topics Create topic-focused scratch files consolidating all requirements from 8 source documents into bite-sized, implementable chunks: - 01: Embedding Model Stack (Qwen3-VL-Embedding-8B, reranker, fallbacks) - 02: Chunking Strategies (7 methods: fixed, sentence, semantic, recursive, AST, multimodal, fusion) - 03: Metadata Schema (12-dimension chunk metadata with TypeScript interfaces) - 04: Database Schema (SQLite tables, indexes, relationships) - 05: MemVid Storage (H.265 video encoding, quad-encoding, FAISS) - 06: Memory Hierarchy (Hot/Warm/Cold: ByteRover, Graphiti, MemVid) - 07: Retrieval Pipeline (two-stage recall + rerank, hybrid search, cross-modal) - 08: Orchestration & Concurrency (agentic swarm, workers, MCP servers) - 09: Proxy/Shim (Claude wrapper, context injection, sanitization) - 10: Sleep-Time Compute (autonomous refinement loops, Tribunal, Mutator) - 11: Quality Assurance (validation, error handling, verification queries) - 12: Domain Configuration (prompts/codebase/research routing) - 13: Strategic Integrations (hypergraph, active inference, formal verification) TASK_INDEX.md provides implementation order, dependency graph, complexity estimates, and cross-document conflict resolution. --- scratch/01-embedding-model-stack.md | 187 ++++++++++++++++ scratch/02-chunking-strategies.md | 202 +++++++++++++++++ scratch/03-metadata-schema.md | 285 ++++++++++++++++++++++++ scratch/04-database-schema.md | 253 +++++++++++++++++++++ scratch/05-memvid-storage.md | 206 +++++++++++++++++ scratch/06-memory-hierarchy.md | 123 ++++++++++ scratch/07-retrieval-pipeline.md | 169 ++++++++++++++ scratch/08-orchestration-concurrency.md | 160 +++++++++++++ scratch/09-proxy-shim.md | 103 +++++++++ scratch/10-sleep-time-compute.md | 138 ++++++++++++ scratch/11-quality-assurance.md | 120 ++++++++++ scratch/12-domain-configuration.md | 125 +++++++++++ scratch/13-strategic-integrations.md | 96 ++++++++ scratch/TASK_INDEX.md | 122 ++++++++++ 14 files changed, 2289 insertions(+) create mode 100644 scratch/01-embedding-model-stack.md create mode 100644 scratch/02-chunking-strategies.md create mode 100644 scratch/03-metadata-schema.md create mode 100644 scratch/04-database-schema.md create mode 100644 scratch/05-memvid-storage.md create mode 100644 scratch/06-memory-hierarchy.md create mode 100644 scratch/07-retrieval-pipeline.md create mode 100644 scratch/08-orchestration-concurrency.md create mode 100644 scratch/09-proxy-shim.md create mode 100644 scratch/10-sleep-time-compute.md create mode 100644 scratch/11-quality-assurance.md create mode 100644 scratch/12-domain-configuration.md create mode 100644 scratch/13-strategic-integrations.md create mode 100644 scratch/TASK_INDEX.md diff --git a/scratch/01-embedding-model-stack.md b/scratch/01-embedding-model-stack.md new file mode 100644 index 0000000..b8a8074 --- /dev/null +++ b/scratch/01-embedding-model-stack.md @@ -0,0 +1,187 @@ +# Topic: Embedding Model Stack + +## Summary +Configuration, initialization, and integration of the Qwen3-VL multimodal embedding models and reranker for the RAG v3.0 system. + +--- + +## Primary Model: Qwen3-VL-Embedding-8B + +> **Source:** [opus-prd1-v3.md](../opus-prd1-v3.md), [opus-prd2-v3.md](../opus-prd2-v3.md), [opus-prd3-v3.md](../opus-prd3-v3.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md) + +- **Model ID:** `Qwen/Qwen3-VL-Embedding-8B` +- **Released:** January 7-8, 2026 (arXiv:2601.04720) +- **Parameters:** 8.14B +- **Layers:** 36 +- **Architecture:** Dual-Tower (qwen3_vl) +- **Context Length:** 32,768 tokens (default 8,192) +- **Native Embedding Dimensions:** 4096 +- **MRL Support:** Yes — options: [256, 512, 1024, 2048, 4096] + - Storage dimension: 1024 (truncated for MemVid efficiency) + - Retrieval dimension: 2048 (higher precision for queries) +- **Quantization:** bf16 (recommended), fp16, int8, int4 +- **Instruction-Aware:** Yes + +### Benchmarks +| Benchmark | Score | +|-----------|-------| +| MMEB-V2 | 77.8 (Rank #1) | +| MMTEB | 67.88 | +| Image Retrieval | 80.0 | +| Video Retrieval | 67.1 | +| VisDoc Retrieval | 82.4 | + +### Supported Input Modalities +- Pure text +- Pure image +- Pure video +- Text + image (mixed) +- Text + video (mixed) +- Image + video (mixed) +- Text + image + video (mixed) +- Screenshots (treated as images with OCR awareness) + +### Vision Configuration +> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md) + +- `min_pixels`: 4096 +- `max_pixels`: 1,843,200 (1280×1440) +- `total_video_pixels`: 7,864,320 +- `default_fps`: 1.0 +- `default_frames`: 64 +- `max_frames`: 64 + +### Inference Configuration +> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md) + +- `torch_dtype`: bfloat16 +- `attn_implementation`: flash_attention_2 +- `device_map`: auto + +### Architecture Details +> **Source:** [opus-prd3-v3.md](../opus-prd3-v3.md) + +- Extracts `[EOS]` token hidden state from last layer as final representation +- Cross-modal pretraining with unified modality projection +- Integrates supervised tasks, masked modeling, and multimodal alignment objectives +- Enables efficient independent encoding for large-scale retrieval + +--- + +## Boundary Detection Model: Qwen3-Embedding-0.6B + +> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md) + +- **Model ID:** `Qwen/Qwen3-Embedding-0.6B` +- **Type:** Text-only +- **Parameters:** 595.8M +- **Native Dimensions:** 1024 +- **Purpose:** Cheap/fast similarity detection for semantic chunking boundary detection + +--- + +## Reranker: Qwen3-VL-Reranker-8B + +> **Source:** [opus-prd1-v3.md](../opus-prd1-v3.md), [opus-prd2-v3.md](../opus-prd2-v3.md) + +- **Model ID:** `Qwen/Qwen3-VL-Reranker-8B` +- **Parameters:** 8.14B +- **Layers:** 36 +- **Architecture:** Single-Tower with Cross-Attention +- **Input:** (Query, Document) pairs — both can be mixed-modal +- **Output:** Relevance score (via yes/no token generation probability) +- **Supported Modalities:** text, image, video, mixed +- **Inference:** bfloat16, flash_attention_2 + +### Smaller Variant: Qwen3-VL-Reranker-2B +- **Parameters:** 2.13B +- Same architecture (Single-Tower) + +--- + +## Fallback Model: Qwen3-Embedding-8B (Text-Only) + +> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md) + +- **Model ID:** `Qwen/Qwen3-Embedding-8B` +- **Type:** Text-only +- **Parameters:** 7.57B +- **Native Dimensions:** 4096 +- **MTEB Score:** 70.58 (Rank #1) +- **Note:** Higher MTEB score than VL model (70.58 vs 67.88) but lacks multimodal capabilities + +--- + +## Model Initialization Code + +> **Source:** [chatgpt5.2-prd.md](../chatgpt5.2-prd.md) (Phase 1), [opus-prd1-v3.md](../opus-prd1-v3.md) + +```python +import torch +from src.models.qwen3_vl_embedding import Qwen3VLEmbedder +from src.models.qwen3_vl_reranker import Qwen3VLReranker + +# Primary Embedding Model +embedder = Qwen3VLEmbedder( + model_name_or_path="Qwen/Qwen3-VL-Embedding-8B", + max_length=8192, + min_pixels=4096, + max_pixels=1843200, + total_pixels=7864320, + fps=1.0, + num_frames=64, + max_frames=64, + torch_dtype=torch.bfloat16, + attn_implementation="flash_attention_2" +) + +# Precision Reranker +reranker = Qwen3VLReranker( + model_name_or_path="Qwen/Qwen3-VL-Reranker-8B", + torch_dtype=torch.bfloat16, + attn_implementation="flash_attention_2" +) +``` + +--- + +## Alternative Models Considered + +> **Source:** [chatgpt5.2-prd.md](../chatgpt5.2-prd.md) + +- **Gemini Text-Embedding-001:** Upcoming model (replacing text-embedding-004), expected January 16, 2026. Considered as alternative/complement. +- **Qwen3-VL-Embedding-2B:** Lightweight variant (2.13B params, 2048 dims, MMEB-V2: 73.2) + +--- + +## Cost Analysis + +> **Source:** [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md), [chatgpt5.2-prd.md](../chatgpt5.2-prd.md) + +| Component | Model | Cost | +|-----------|-------|------| +| Embedding | Qwen3-VL-Embedding-8B | ~$0.03/1M tokens* | +| Reranking | Qwen3-VL-Reranker-8B | ~$0.05/1M tokens* | +| Ingestion (35MB) | One-time | ~$0.10 | +| Queries (10K/day, annual) | - | ~$5.00 | + +*Estimated — not yet on OpenRouter, requires self-hosting or wait for API availability. + +--- + +## Implementation Requirements + +1. Set up Qwen3-VL-Embedding-8B environment with flash_attention_2 +2. Implement model wrapper classes (`Qwen3VLEmbedder`, `Qwen3VLReranker`) +3. Support MRL dimension truncation for storage vs retrieval +4. Implement multimodal input preprocessing (text, image, video, mixed) +5. Add fallback to text-only model on multimodal failure +6. Integrate with OpenRouter for remote inference + +--- + +## Conflicts / Ambiguities + +- **⚠️ Dimension mismatch:** chatgpt5.2-prd.md mentions "1526 or 3746 or 3182" as possible embedding sizes — these don't match the actual Qwen3-VL dimensions (4096 native, MRL options: 256/512/1024/2048/4096). The opus PRDs provide the correct values. +- **⚠️ Gemini alternative:** chatgpt5.2-prd.md suggests potentially using both Qwen and Gemini embeddings. No other document addresses dual-embedding strategy. +- **⚠️ Hosting:** chatgpt5.2-prd.md assumes OpenRouter availability; cost estimates are speculative since the model may require self-hosting. diff --git a/scratch/02-chunking-strategies.md b/scratch/02-chunking-strategies.md new file mode 100644 index 0000000..e90692c --- /dev/null +++ b/scratch/02-chunking-strategies.md @@ -0,0 +1,202 @@ +# Topic: Chunking Strategies + +## Summary +Seven distinct chunking methods for processing different content types (text, code, mixed-modal) into the RAG system. Includes configuration, routing logic, and the four-layer epistemic scaffolding model. + +--- + +## Conceptual Framework: Four-Layer Epistemic Scaffolding + +> **Source:** [opus-prd3-v3.md](../opus-prd3-v3.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md) + +Chunking has evolved into a four-layer system: +1. **Fixed-length chunking** — mechanical, deterministic +2. **Sentence/semantic-unit chunking** — linguistic awareness +3. **Semantic coherence chunking (agentic)** — meaning-aware boundaries +4. **Recursive hierarchical chunking (agentic)** — document-structure-aware + +--- + +## Method 1: Fixed-Size Chunking + +> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md), [chatgpt5.2-prd.md](../chatgpt5.2-prd.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md) + +- **Window tokens:** 512 +- **Overlap tokens:** 50 +- **Applies to:** configuration files, data files +- **Modalities:** text only +- **Agent required:** No — can be done programmatically + +### Implementation Notes +> **Source:** [chatgpt5.2-prd.md](../chatgpt5.2-prd.md) + +- Length-based chunking can be done programmatically without an LLM agent +- Simplest method, serves as fallback for AST chunking failures + +--- + +## Method 2: Sentence-Based Chunking + +> **Source:** [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md), [chatgpt5.2-prd.md](../chatgpt5.2-prd.md) + +- **Window size:** 3 sentences +- **Min chunk tokens:** 128 +- **Max chunk tokens:** 2048 +- **Agent required:** No — can be done programmatically + +--- + +## Method 3: Semantic Chunking (Agentic) + +> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md), [chatgpt5.2-prd.md](../chatgpt5.2-prd.md) + +- **Similarity threshold:** 0.75 +- **Window size:** 3 sentences +- **Boundary detection model:** `Qwen/Qwen3-Embedding-0.6B` +- **Min chunk tokens:** 128 +- **Max chunk tokens:** 2048 +- **Applies to:** documentation, research papers +- **Modalities:** text only +- **Agent required:** Yes — requires intelligence for boundary detection + +### How It Works +> **Source:** [opus-prd3-v3.md](../opus-prd3-v3.md) + +Uses embedding similarity between adjacent sentence windows to detect topic shifts. When similarity drops below threshold (0.75), a chunk boundary is placed. The lightweight 0.6B model handles boundary detection cheaply. + +--- + +## Method 4: Recursive Hierarchical Chunking (Agentic) + +> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md), [chatgpt5.2-prd.md](../chatgpt5.2-prd.md) + +- **Chunk size tokens:** 1024 +- **Overlap tokens:** 100 +- **Separators** (in priority order): + 1. `"\n\n"` — Paragraphs + 2. `"\n"` — Lines + 3. `". "` — Sentences + 4. `" "` — Words (last resort) +- **Applies to:** documentation, conversation +- **Modalities:** text only +- **Agent required:** Yes — requires understanding of document structure + +--- + +## Method 5: AST Structural Chunking (Code) + +> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md) + +### Supported Languages & Parsers + +| Language | Parser | AST Nodes | +|----------|--------|-----------| +| Python | tree-sitter-python | function_definition, class_definition, decorated_definition | +| TypeScript | tree-sitter-typescript | function_declaration, class_declaration, method_definition, interface_declaration | +| JavaScript | tree-sitter-javascript | function_declaration, class_declaration, method_definition | +| Go | tree-sitter-go | function_declaration, method_declaration, type_declaration | +| Rust | tree-sitter-rust | function_item, impl_item, struct_item, trait_item | +| Java | tree-sitter-java | method_declaration, class_declaration, constructor_declaration, interface_declaration | + +### Configuration +- `prepend_parent_context`: true +- `preserve_docstrings`: true +- `preserve_imports`: true +- `extract_dependencies`: true +- `compute_complexity`: true +- `fallback_to_fixed`: true (falls back to fixed-size 512 tokens on parse failure) + +### Applies to +- Content types: code +- Modalities: text + +--- + +## Method 6: Multimodal Boundary Detection (NEW) + +> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md) + +- **Visual context window:** 1 paragraph before/after +- **Caption detection:** true +- **Figure reference detection:** true +- **Preserve figure-caption pairs:** true +- **Applies to:** documentation, research papers +- **Modalities:** mixed_text_image, mixed_all + +### Purpose +Detects boundaries between text and visual content in mixed documents. Ensures figures, diagrams, and their captions are kept together as coherent chunks. + +--- + +## Method 7: Screenshot-Code Fusion (NEW) + +> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md) + +- **Matching strategies:** + - `filename_similarity` — match screenshots to code files by name + - `ocr_text_matching` — extract text from screenshots, match to code + - `reference_comment_detection` — find code comments referencing screenshots +- **Applies to:** code +- **Modalities:** mixed_text_image + +### Purpose +Fuses UI screenshots with the code that generates them, creating cross-modal chunks that link visual output to source code. + +--- + +## Content Type → Chunking Method Routing + +> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md) (domains section) + +| Domain | Chunking Methods | +|--------|-----------------| +| Prompts | semantic, fixed_size | +| Codebase | ast_structural, fixed_size, screenshot_code_fusion | +| Research | recursive_hierarchical, semantic, multimodal_boundary | + +--- + +## Asynchronous / Multi-Agent Chunking + +> **Source:** [chatgpt5.2-prd.md](../chatgpt5.2-prd.md) + +### Agent Assignment by Method +- **Fixed-size & Sentence-based:** Programmatic (no LLM needed) +- **Semantic chunking:** Requires LLM intelligence — can use Haiku/Flash-class model +- **Recursive hierarchical:** Requires higher intelligence — Sonnet/Pro-class model recommended + +### Key Questions from Requirements +- Can semantic and recursive hierarchical chunking be done in a single pass by one agent, or do they require separate passes? +- The user suggests asynchronous processing across files is ideal given the multi-file corpus + +### Model Recommendations for Agentic Chunking +> **Source:** [chatgpt5.2-prd.md](../chatgpt5.2-prd.md) + +- Sonnet/Gemini Pro class: For recursive hierarchical chunking +- Haiku/Gemini Flash class: For semantic chunking +- Free models (e.g., MIMO V2 via OpenRouter): For simpler tasks + +--- + +## Quad Encoding (MemVid-Specific Chunking) + +> **Source:** [gemini-prd.md](../gemini-prd.md) (Appendix I) + +MemVid uses "Quad Encoding" — encoding the same content at four resolutions: + +| Resolution | What it Encodes | Agent Query Type | +|-----------|----------------|-----------------| +| Word (Token) | Keywords & Entities | Exact definitions, variable names | +| Sentence | Discrete Facts | Return types, specific error codes | +| Paragraph | Local Context | How a flow handles edge cases | +| Boundary | Relationships & Flow | What connects between sections | + +This is done during sleep-time compute (not real-time) due to 4x embedding cost. + +--- + +## Conflicts / Ambiguities + +- **⚠️ Chunk size inconsistency:** chatgpt5.2-prd.md mentions "1.5-3K tokens" for chunks; opus-prd2-v3.md specifies 512 tokens (fixed), 1024 tokens (recursive), 128-2048 tokens (semantic). The AGGREGATION_PLAN.md lists yet another set: "1.5-3k tokens with 200-400 token overlap" for fixed-size. The opus-prd2 YAML config should be treated as authoritative. +- **⚠️ Number of methods:** UNIFIED_PRD.md lists 7 methods; chatgpt5.2-prd.md discusses 4 core methods; opus-prd2-v3.md defines 6 in YAML config. The 7-method list (adding sentence-based as distinct from semantic) is the most complete. +- **⚠️ Agentic vs programmatic:** chatgpt5.2-prd.md suggests semantic and recursive hierarchical need LLM agents; opus-prd2-v3.md treats semantic chunking as algorithmic (embedding similarity threshold). Resolution: semantic chunking uses the lightweight 0.6B model algorithmically, not a full LLM agent. \ No newline at end of file diff --git a/scratch/03-metadata-schema.md b/scratch/03-metadata-schema.md new file mode 100644 index 0000000..6d20688 --- /dev/null +++ b/scratch/03-metadata-schema.md @@ -0,0 +1,285 @@ +# Topic: Metadata Schema (12 Dimensions) + +## Summary +The 12-dimensional chunk metadata schema for the RAG v3.0 system, including TypeScript interfaces and YAML configuration for enabling/disabling dimensions. + +--- + +## Schema Overview + +> **Source:** [opus-prd1-v3.md](../opus-prd1-v3.md) (Phase 2), [docs/SCHEMA_REFERENCE.md](../docs/SCHEMA_REFERENCE.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md) + +The metadata schema has 12 dimensions, each capturing a different aspect of chunk information: + +``` +1. IDENTITY — Unique identification and versioning +2. PROVENANCE — Complete audit trail +3. CONTENT — What the chunk contains +4. STRUCTURE — How the chunk was created +5. HIERARCHY — Document structure preservation +6. SEMANTIC — Extracted meaning and classification +7. CODE_SPECIFIC — Code-only metadata (AST, complexity, imports) +8. MULTIMODAL — Cross-modal relationships +9. EMBEDDING — Vector representation metadata +10. GRAPH — Knowledge graph relationships +11. QUALITY — Quality metrics and validation +12. RETRIEVAL — Retrieval analytics and feedback +``` + +--- + +## Dimension 1: IDENTITY + +> **Source:** [docs/SCHEMA_REFERENCE.md](../docs/SCHEMA_REFERENCE.md), [opus-prd1-v3.md](../opus-prd1-v3.md) + +| Field | Type | Description | +|-------|------|-------------| +| chunk_id | string | UUID v7 (time-sortable) | +| content_hash | string | SHA-256 of raw content (deduplication) | +| version | number | Incremental version for updates | +| parent_chunk_id | string/null | If this is a sub-chunk | +| root_document_id | string | Original document this came from | +| corpus_id | string | Which corpus/domain (prompts/code/research) | + +**Config:** `generate_uuid_v7: true`, `compute_content_hash: true` + +--- + +## Dimension 2: PROVENANCE + +> **Source:** [docs/SCHEMA_REFERENCE.md](../docs/SCHEMA_REFERENCE.md), [opus-prd1-v3.md](../opus-prd1-v3.md) + +| Field | Type | Description | +|-------|------|-------------| +| source_uri | string | file://path or https://url | +| source_type | enum | local_file, git_repo, web_url, api, user_upload | +| git_metadata | object | repository, commit_sha, branch, timestamp, author, file_path | +| author | object | name, email, organization | +| license | string | SPDX identifier | +| created_at | string | ISO 8601 | +| modified_at | string | ISO 8601 | +| ingested_at | string | ISO 8601 | +| ingestion_pipeline_version | string | e.g., "3.0.0" | + +**Config:** `git_integration: true`, `track_authors: true`, `track_license: true` + +--- + +## Dimension 3: CONTENT + +> **Source:** [docs/SCHEMA_REFERENCE.md](../docs/SCHEMA_REFERENCE.md) + +| Field | Type | Description | +|-------|------|-------------| +| content_type | enum | code, documentation, research_paper, prompt, configuration, data, conversation, mixed | +| modalities | Modality[] | text, image, video, audio, screenshot, diagram | +| primary_modality | Modality | Dominant modality | +| language.natural | string | ISO 639-1 (e.g., 'en') | +| language.programming | string | e.g., 'python', 'typescript' | +| mime_type | string | e.g., 'text/markdown' | +| byte_size | number | Size in bytes | +| encoding | string | e.g., 'utf-8' | + +**Config:** `detect_language: true`, `detect_modalities: true` + +--- + +## Dimension 4: STRUCTURE + +> **Source:** [docs/SCHEMA_REFERENCE.md](../docs/SCHEMA_REFERENCE.md) + +| Field | Type | Description | +|-------|------|-------------| +| chunking_method | enum | fixed_size, sentence_based, semantic, recursive_hierarchical, ast_structural, multimodal_boundary, manual | +| chunking_config | object | target_tokens, overlap_tokens, similarity_threshold, separators | +| token_count | number | Token count | +| char_count | number | Character count | +| word_count | number | Word count | +| line_count | number | Line count | +| overlap.previous_chunk_id | string | Previous chunk reference | +| overlap.previous_overlap_tokens | number | Overlap with previous | +| overlap.next_chunk_id | string | Next chunk reference | +| overlap.next_overlap_tokens | number | Overlap with next | +| boundaries.start_offset | number | Byte offset in source | +| boundaries.end_offset | number | End byte offset | +| boundaries.start_line | number | Start line number | +| boundaries.end_line | number | End line number | + +**Config:** `track_overlaps: true`, `track_boundaries: true` + +--- + +## Dimension 5: HIERARCHY + +> **Source:** [docs/SCHEMA_REFERENCE.md](../docs/SCHEMA_REFERENCE.md) + +| Field | Type | Description | +|-------|------|-------------| +| depth_level | number | 0=root, 1=section, 2=subsection... | +| section_path | string[] | e.g., ["Chapter 1", "Introduction", "Background"] | +| heading_text | string | Current section heading | +| parent_heading | string | Parent section heading | +| document_position | object | section_index, chunk_index_in_section, total_chunks_in_section, global_chunk_index, total_document_chunks | +| sibling_chunk_ids | string[] | Other chunks at same level | +| child_chunk_ids | string[] | Sub-chunks if hierarchical | + +**Config:** `max_depth: 10`, `track_siblings: true` + +--- + +## Dimension 6: SEMANTIC + +> **Source:** [docs/SCHEMA_REFERENCE.md](../docs/SCHEMA_REFERENCE.md) + +| Field | Type | Description | +|-------|------|-------------| +| topic_cluster_id | string | Cluster assignment from topic modeling | +| topic_keywords | string[] | Top keywords for this topic | +| topic_confidence | number | Confidence score | +| entities | NamedEntity[] | Extracted named entities with type, confidence, offsets | +| keywords | object[] | term, tfidf_score, is_technical | +| summary | string | Auto-generated 1-2 sentence summary | +| intent_classification | object | primary_intent (explanation/tutorial/reference), confidence | +| sentiment | object | polarity (-1 to 1), subjectivity (0 to 1) | +| reading_level | string | technical, beginner, expert | + +**Entity Types:** PERSON, ORG, PRODUCT, TECH, CONCEPT, LOCATION, DATE, CODE_ELEMENT + +**Config:** `extract_entities: true`, `extract_keywords: true`, `generate_summaries: true`, `classify_intent: true` + +--- + +## Dimension 7: CODE_SPECIFIC + +> **Source:** [docs/SCHEMA_REFERENCE.md](../docs/SCHEMA_REFERENCE.md) + +| Field | Type | Description | +|-------|------|-------------| +| ast_node_type | enum | module, class_definition, function_definition, method_definition, etc. | +| parent_scope | string | e.g., "ClassName.method_name" | +| fully_qualified_name | string | e.g., "module.ClassName.method_name" | +| signature | string | Function/method signature | +| return_type | string | Return type | +| parameters | object[] | name, type, default_value | +| imports | object[] | module, items, is_relative | +| exports | string[] | Exported symbols | +| docstring | object | summary, params, returns, raises, examples | +| complexity | object | cyclomatic, cognitive, lines_of_code, lines_of_comments | +| dependencies | object | internal (same codebase), external (packages) | +| test_coverage | object | covered, test_file, coverage_percentage | + +**Config:** `extract_docstrings: true`, `compute_complexity: true`, `track_dependencies: true`, `track_test_coverage: false` + +--- + +## Dimension 8: MULTIMODAL + +> **Source:** [docs/SCHEMA_REFERENCE.md](../docs/SCHEMA_REFERENCE.md) + +| Field | Type | Description | +|-------|------|-------------| +| visual_elements | VisualElement[] | figure, table, diagram, screenshot, equation, chart | +| referenced_images | string[] | Image chunk IDs referenced | +| referenced_code_blocks | string[] | Code chunk IDs referenced | +| referenced_videos | string[] | Video chunk IDs referenced | +| cross_modal_links | CrossModalLink[] | Links between modalities | +| diagram_analysis | object | diagram_type, extracted_nodes, extracted_relationships | +| ocr_extraction | object | full_text, confidence, language_detected | + +**CrossModalLink relationship types:** references, illustrates, implements, documents, derives_from, related_to + +**Config:** `extract_visual_elements: true`, `run_ocr: true`, `detect_diagram_types: true`, `build_cross_modal_links: true` + +--- + +## Dimension 9: EMBEDDING + +> **Source:** [docs/SCHEMA_REFERENCE.md](../docs/SCHEMA_REFERENCE.md) + +| Field | Type | Description | +|-------|------|-------------| +| model_id | string | e.g., "qwen/qwen3-vl-embedding-8b" | +| model_version | string | Model version | +| native_dimensions | number | Original output dims (e.g., 4096) | +| stored_dimensions | number | After MRL truncation (e.g., 1024) | +| mrl_truncated | boolean | Whether MRL was applied | +| quantization | string | bf16, fp16, int8, int4 | +| instruction_used | string | The instruction prefix used | +| embedding_hash | string | Hash of the embedding vector | +| embedded_at | string | ISO 8601 timestamp | + +**Config:** `track_model_version: true`, `compute_embedding_hash: true` + +--- + +## Dimension 10: GRAPH + +> **Source:** [docs/SCHEMA_REFERENCE.md](../docs/SCHEMA_REFERENCE.md) + +| Field | Type | Description | +|-------|------|-------------| +| incoming_refs | object[] | source_chunk_id, relationship_type, weight | +| outgoing_refs | object[] | target_chunk_id, relationship_type, weight | +| semantic_neighbors | object[] | chunk_id, similarity_score, model_id | +| coreference_chain | string | Coreference chain ID | +| dependency_graph | object | upstream_ids, downstream_ids | + +**Config:** `compute_semantic_neighbors: true`, `neighbor_top_k: 10`, `track_coreferences: false` (expensive, optional) + +--- + +## Dimension 11: QUALITY + +> **Source:** [docs/SCHEMA_REFERENCE.md](../docs/SCHEMA_REFERENCE.md) + +| Field | Type | Description | +|-------|------|-------------| +| confidence_score | number | Overall confidence (0-1) | +| validation_status | enum | valid, warning, error, pending | +| error_flags | string[] | List of detected issues | +| review_status | enum | auto_approved, needs_review, reviewed, rejected | +| chunking_quality | object | coherence_score, completeness_score, boundary_quality | + +**Config:** `validate_chunks: true`, `compute_coherence: true` + +--- + +## Dimension 12: RETRIEVAL + +> **Source:** [docs/SCHEMA_REFERENCE.md](../docs/SCHEMA_REFERENCE.md) + +| Field | Type | Description | +|-------|------|-------------| +| access_count | number | Times retrieved | +| retrieval_success_rate | number | How often selected after retrieval | +| user_feedback_score | number | Aggregated user feedback | +| freshness_decay | number | Time-based relevance decay | +| last_accessed_at | string | ISO 8601 | + +**Config:** `track_access: true`, `track_feedback: true`, `compute_freshness_decay: true` + +--- + +## YAML Configuration Reference + +> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md), [docs/SCHEMA_REFERENCE.md](../docs/SCHEMA_REFERENCE.md) + +All 12 dimensions can be individually enabled/disabled via YAML config under `metadata.dimensions..enabled`. + +--- + +## Implementation Requirements + +1. Define TypeScript interfaces for all 12 dimensions (reference code in SCHEMA_REFERENCE.md) +2. Implement Pydantic models (Python) matching the TypeScript interfaces +3. Build metadata extraction pipeline for each dimension +4. Create configuration loader for enabling/disabling dimensions +5. Implement content_hash computation (SHA-256) +6. Implement UUID v7 generation for chunk_id + +--- + +## Conflicts / Ambiguities + +- **⚠️ Schema completeness:** The UNIFIED_PRD.md schema overview shows fewer fields per dimension than the full TypeScript interfaces in SCHEMA_REFERENCE.md. The TypeScript interfaces are the authoritative source. +- **⚠️ Hierarchy fields:** The overview diagram shows `sibling_ids[]` but the TypeScript interface uses `sibling_chunk_ids` and adds `child_chunk_ids`. Use the TypeScript interface names. \ No newline at end of file diff --git a/scratch/04-database-schema.md b/scratch/04-database-schema.md new file mode 100644 index 0000000..f8a2b17 --- /dev/null +++ b/scratch/04-database-schema.md @@ -0,0 +1,253 @@ +# Topic: Database Schema (SQLite) + +## Summary +SQLite database schema optimized for MemVid video-encoded storage, including all tables, relationships, and indexes for the RAG v3.0 system. + +--- + +## Tables Overview + +> **Source:** [docs/SCHEMA_REFERENCE.md](../docs/SCHEMA_REFERENCE.md) + +| Table | Purpose | +|-------|---------| +| chunks | Core chunk storage with essential fields | +| embeddings | Multiple embedding versions per chunk, MemVid integration | +| chunk_relationships | Normalized graph relationships | +| semantic_neighbors | Precomputed similar-chunk retrieval | +| cross_modal_links | Multimodal retrieval support | +| entities | Entity definitions | +| chunk_entities | Entity-to-chunk mapping with offsets | +| retrieval_events | Analytics tracking | +| memvid_indices | MemVid video-encoded storage mapping | + +--- + +## Table: chunks + +> **Source:** [docs/SCHEMA_REFERENCE.md](../docs/SCHEMA_REFERENCE.md) + +```sql +CREATE TABLE chunks ( + chunk_id TEXT PRIMARY KEY, + content_hash TEXT NOT NULL, + version INTEGER DEFAULT 1, + corpus_id TEXT NOT NULL, + root_document_id TEXT NOT NULL, + raw_content TEXT NOT NULL, + content_type TEXT NOT NULL, + modalities TEXT NOT NULL, -- JSON array + primary_modality TEXT NOT NULL, + token_count INTEGER NOT NULL, + chunking_method TEXT NOT NULL, + parent_chunk_id TEXT, + depth_level INTEGER DEFAULT 0, + created_at TEXT NOT NULL, + modified_at TEXT NOT NULL, + ingested_at TEXT NOT NULL, + metadata_json TEXT NOT NULL, -- Complete ChunkMetadata object + FOREIGN KEY (parent_chunk_id) REFERENCES chunks(chunk_id) +); +``` + +**Design notes:** +- `metadata_json` stores the complete 12-dimension metadata as JSON for flexibility +- Core fields are denormalized for fast queries without JSON parsing +- `modalities` stored as JSON array string + +--- + +## Table: embeddings + +> **Source:** [docs/SCHEMA_REFERENCE.md](../docs/SCHEMA_REFERENCE.md) + +```sql +CREATE TABLE embeddings ( + embedding_id TEXT PRIMARY KEY, + chunk_id TEXT NOT NULL, + model_id TEXT NOT NULL, + dimensions INTEGER NOT NULL, + mrl_truncated INTEGER DEFAULT 0, + quantization TEXT, + vector BLOB, -- Or reference to MemVid frame + memvid_frame_index INTEGER, -- If stored in MemVid + memvid_file TEXT, -- Which .mp4 file + instruction_used TEXT, + embedded_at TEXT NOT NULL, + embedding_hash TEXT NOT NULL, + FOREIGN KEY (chunk_id) REFERENCES chunks(chunk_id) +); +``` + +**Design notes:** +- Supports both direct BLOB storage and MemVid frame references +- Multiple embeddings per chunk (different models, dimensions) + +--- + +## Table: chunk_relationships + +> **Source:** [docs/SCHEMA_REFERENCE.md](../docs/SCHEMA_REFERENCE.md) + +```sql +CREATE TABLE chunk_relationships ( + relationship_id TEXT PRIMARY KEY, + source_chunk_id TEXT NOT NULL, + target_chunk_id TEXT NOT NULL, + relationship_type TEXT NOT NULL, + weight REAL DEFAULT 1.0, + evidence TEXT, + created_at TEXT NOT NULL, + FOREIGN KEY (source_chunk_id) REFERENCES chunks(chunk_id), + FOREIGN KEY (target_chunk_id) REFERENCES chunks(chunk_id) +); +``` + +--- + +## Table: semantic_neighbors + +> **Source:** [docs/SCHEMA_REFERENCE.md](../docs/SCHEMA_REFERENCE.md) + +```sql +CREATE TABLE semantic_neighbors ( + chunk_id TEXT NOT NULL, + neighbor_chunk_id TEXT NOT NULL, + similarity_score REAL NOT NULL, + computed_at TEXT NOT NULL, + model_id TEXT NOT NULL, + PRIMARY KEY (chunk_id, neighbor_chunk_id, model_id), + FOREIGN KEY (chunk_id) REFERENCES chunks(chunk_id), + FOREIGN KEY (neighbor_chunk_id) REFERENCES chunks(chunk_id) +); +``` + +--- + +## Table: cross_modal_links + +> **Source:** [docs/SCHEMA_REFERENCE.md](../docs/SCHEMA_REFERENCE.md) + +```sql +CREATE TABLE cross_modal_links ( + link_id TEXT PRIMARY KEY, + source_chunk_id TEXT NOT NULL, + target_chunk_id TEXT NOT NULL, + source_modality TEXT NOT NULL, + target_modality TEXT NOT NULL, + relationship_type TEXT NOT NULL, + confidence REAL NOT NULL, + anchor_text TEXT, + FOREIGN KEY (source_chunk_id) REFERENCES chunks(chunk_id), + FOREIGN KEY (target_chunk_id) REFERENCES chunks(chunk_id) +); +``` + +--- + +## Tables: entities & chunk_entities + +> **Source:** [docs/SCHEMA_REFERENCE.md](../docs/SCHEMA_REFERENCE.md) + +```sql +CREATE TABLE entities ( + entity_id TEXT PRIMARY KEY, + entity_text TEXT NOT NULL, + entity_type TEXT NOT NULL, + canonical_name TEXT, + knowledge_base_id TEXT +); + +CREATE TABLE chunk_entities ( + chunk_id TEXT NOT NULL, + entity_id TEXT NOT NULL, + mention_text TEXT NOT NULL, + start_offset INTEGER NOT NULL, + end_offset INTEGER NOT NULL, + confidence REAL NOT NULL, + PRIMARY KEY (chunk_id, entity_id, start_offset), + FOREIGN KEY (chunk_id) REFERENCES chunks(chunk_id), + FOREIGN KEY (entity_id) REFERENCES entities(entity_id) +); +``` + +--- + +## Table: retrieval_events + +> **Source:** [docs/SCHEMA_REFERENCE.md](../docs/SCHEMA_REFERENCE.md) + +```sql +CREATE TABLE retrieval_events ( + event_id TEXT PRIMARY KEY, + chunk_id TEXT NOT NULL, + query_text TEXT, + query_embedding_hash TEXT, + retrieval_rank INTEGER, + rerank_score REAL, + was_selected INTEGER, + user_feedback INTEGER, -- -1, 0, 1 + timestamp TEXT NOT NULL, + FOREIGN KEY (chunk_id) REFERENCES chunks(chunk_id) +); +``` + +--- + +## Table: memvid_indices + +> **Source:** [docs/SCHEMA_REFERENCE.md](../docs/SCHEMA_REFERENCE.md) + +```sql +CREATE TABLE memvid_indices ( + memvid_file TEXT NOT NULL, + frame_index INTEGER NOT NULL, + chunk_id TEXT NOT NULL, + embedding_id TEXT NOT NULL, + corpus_id TEXT NOT NULL, + PRIMARY KEY (memvid_file, frame_index), + FOREIGN KEY (chunk_id) REFERENCES chunks(chunk_id), + FOREIGN KEY (embedding_id) REFERENCES embeddings(embedding_id) +); +``` + +--- + +## Indexes + +> **Source:** [docs/SCHEMA_REFERENCE.md](../docs/SCHEMA_REFERENCE.md) + +```sql +CREATE INDEX idx_chunks_corpus ON chunks(corpus_id); +CREATE INDEX idx_chunks_content_type ON chunks(content_type); +CREATE INDEX idx_chunks_document ON chunks(root_document_id); +CREATE INDEX idx_chunks_parent ON chunks(parent_chunk_id); +CREATE INDEX idx_embeddings_chunk ON embeddings(chunk_id); +CREATE INDEX idx_embeddings_model ON embeddings(model_id); +CREATE INDEX idx_relationships_source ON chunk_relationships(source_chunk_id); +CREATE INDEX idx_relationships_target ON chunk_relationships(target_chunk_id); +CREATE INDEX idx_relationships_type ON chunk_relationships(relationship_type); +CREATE INDEX idx_neighbors_similarity ON semantic_neighbors(similarity_score DESC); +CREATE INDEX idx_cross_modal_source ON cross_modal_links(source_chunk_id); +CREATE INDEX idx_cross_modal_modality ON cross_modal_links(source_modality, target_modality); +CREATE INDEX idx_entities_type ON entities(entity_type); +CREATE INDEX idx_chunk_entities_entity ON chunk_entities(entity_id); +``` + +--- + +## Implementation Requirements + +1. Create SQLite database initialization script with all tables +2. Create migration system for schema versioning +3. Implement data access layer (DAL) with CRUD operations for each table +4. Add JSON validation for `metadata_json` and `modalities` fields +5. Implement content_hash-based deduplication logic +6. Build query helpers for common access patterns (by corpus, by document, by content_type) + +--- + +## Conflicts / Ambiguities + +- **⚠️ SQLite vs other databases:** The schema is SQLite-specific, but gemini-prd.md mentions FalkorDB (via Bolt Protocol) for Graphiti and Qdrant/FAISS for vector search. The SQLite schema appears to be for the chunk metadata store only, not for vector search or graph queries. +- **⚠️ Vector storage:** The embeddings table stores vectors as BLOB, but actual vector similarity search would use FAISS/HNSW indexes (see MemVid topic). SQLite is the metadata store, not the vector search engine. diff --git a/scratch/05-memvid-storage.md b/scratch/05-memvid-storage.md new file mode 100644 index 0000000..fb435a4 --- /dev/null +++ b/scratch/05-memvid-storage.md @@ -0,0 +1,206 @@ +# Topic: MemVid Storage (Video-Encoded Vector Storage) + +## Summary +MemVid is the Cold Memory storage layer that encodes chunks and embeddings into H.265 compressed video files (QR frames) for massive compression. Includes encoder configuration, quad-encoding strategy, and file organization. + +--- + +## What is MemVid? + +> **Source:** [gemini-prd.md](../gemini-prd.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md) + +MemVid is a deep archive (90d+) storage system that uses H.265 compressed video (QR frames) to store massive datasets. It provides: +- 50-100x compression over raw storage +- Quad-encoded vectors for high-fidelity retrieval +- Portable archive files (.mp4) + +--- + +## Encoder Configuration + +> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md) + +```yaml +memvid: + encoder: + codec: "hevc" # H.265 + crf: 18 # Constant Rate Factor (quality) + gop: 30 # Group of Pictures + preset: "medium" # Encoding speed/quality tradeoff + + vector_config: + input_dimensions: 4096 # Native Qwen3-VL output + storage_dimensions: 1024 # MRL-truncated for efficiency + similarity_sort: true # Sort vectors for better compression + + features: + parallel_segments: true + smart_recall: true + text_search: true + hnsw_index: true +``` + +--- + +## File Organization + +> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md) + +Three separate MemVid files by domain: + +| File | Domain | Content | +|------|--------|---------| +| `codebase.mp4` | Codebase | Multi-repository source code and configs | +| `research.mp4` | Research | Research papers, documentation, diagrams | +| `prompts.mp4` | Prompts | User inputs and prompts to LLMs | + +--- + +## Encoding Process (The "Freeze" Transition) + +> **Source:** [gemini-prd.md](../gemini-prd.md) + +The Warm → Cold transition ("Freeze") follows this process: + +1. **Deconstruction:** Serialize Graphiti nodes into JSON +2. **Rendering:** Generate QR Code images (PNGs) of the JSON data + - QR Code Version 40, High Error Correction +3. **Quad-Encoding:** Generate 4 vector layers per content block +4. **Stitching:** Compile images into an H.265 `.mp4` video file + +### QR Code Specifications +> **Source:** [gemini-prd.md](../gemini-prd.md) + +- **Version:** 40 (maximum capacity) +- **Error Correction:** High +- **Format:** PNG images compiled into video frames + +--- + +## Quad-Encoding Strategy + +> **Source:** [gemini-prd.md](../gemini-prd.md) (Appendix I) + +Each content block is encoded at four resolutions with separate FAISS indices: + +| Layer | Resolution | What it Encodes | Use Case | +|-------|-----------|----------------|----------| +| 1 | Word/Token | Keywords & Entities | Exact definitions, variable names | +| 2 | Sentence/Fact | Discrete Facts | Return types, specific values | +| 3 | Paragraph/Context | Local Context | How flows handle edge cases | +| 4 | Boundary | Relationships & Flow | Cross-section connections | + +### Why Quad-Encoding? +- **Needle in a Haystack Fix:** Word/Sentence vectors allow precise fact retrieval without paragraph dilution +- **Context Drift Fix:** Boundary vectors encode concept edges, preventing information loss at chunk boundaries + +### Storage Architecture: "Heavy Index, Light Payload" +- **Index (4x larger):** 4 FAISS indices per video file +- **Payload (100x smaller):** H.265 compressed video stores actual content +- **Result:** Trading cheap disk space for high intelligence density + +--- + +## MP4 RAG Encoder Implementation + +> **Source:** [gemini-prd.md](../gemini-prd.md) (Appendix II) + +```python +class MP4RAGEncoder: + def __init__(self, frame_width=1920, frame_height=1080): + self.frame_width = frame_width + self.frame_height = frame_height + + def text_to_frame(self, text, chunk_id): + """Convert text chunk to image frame""" + # Render text with word wrap onto image + # Add chunk_id as QR code or metadata overlay + + def encode_chunks_to_mp4(self, chunks, embeddings, metadata, output_path): + """Encode all chunks into MP4 with H.265 compression""" + # Write video with H.265 (HEVC) codec + # Store chunk index and embeddings as sidecar files + + def decode_frame(self, mp4_path, frame_number): + """Quickly seek to specific frame and extract text""" + # OCR the frame to get text back +``` + +### Sidecar Files +Each `.mp4` file has companion files: +- `*_index.json` — Maps frame numbers to chunk IDs and metadata +- `*_embeddings.npy` — NumPy array of embedding vectors + +--- + +## MemVid Index Table (SQLite) + +> **Source:** [docs/SCHEMA_REFERENCE.md](../docs/SCHEMA_REFERENCE.md) + +```sql +CREATE TABLE memvid_indices ( + memvid_file TEXT NOT NULL, + frame_index INTEGER NOT NULL, + chunk_id TEXT NOT NULL, + embedding_id TEXT NOT NULL, + corpus_id TEXT NOT NULL, + PRIMARY KEY (memvid_file, frame_index), + FOREIGN KEY (chunk_id) REFERENCES chunks(chunk_id), + FOREIGN KEY (embedding_id) REFERENCES embeddings(embedding_id) +); +``` + +--- + +## Integration with Sleep-Time Compute + +> **Source:** [gemini-prd.md](../gemini-prd.md) + +Quad-encoding is computationally expensive (4x embedding time) and cannot be done in real-time. The workflow: + +1. **Live (ByteRover):** Simple paragraph chunks (fast, good enough) +2. **Sleep Time (Daemon):** Explodes content into 4 layers, embeds all, encodes to MemVid +3. **Next Day:** Agent has "super-resolution" access to yesterday's work + +--- + +## Retrieval from MemVid: The "Zoom" Pattern + +> **Source:** [gemini-prd.md](../gemini-prd.md) + +Cascading lookup strategy (not all 4 layers at once): + +1. **Scout (Paragraph Layer):** Find general concepts — broad context +2. **Snipe (Sentence Layer):** Check specific facts in the region — precise lines +3. **Stitch (Boundary Layer):** Retrieve boundary vectors to see what connects next + +--- + +## Implementation Requirements + +1. Implement MP4RAGEncoder class with H.265 encoding +2. Implement QR code generation for JSON serialization +3. Build quad-encoding pipeline (4 FAISS indices per video) +4. Create sidecar file management (index.json, embeddings.npy) +5. Implement frame seeking and OCR-based decoding +6. Build the "Zoom" pattern retrieval logic +7. Integrate with sleep-time daemon for batch encoding + +--- + +## Dependencies + +- FFmpeg (H.265/HEVC encoding) +- OpenCV (cv2) for video I/O +- FAISS for vector indexing +- Pillow for image generation +- QR code library (qrcode or similar) +- Ghostscript (for QR rendering) + +--- + +## Conflicts / Ambiguities + +- **⚠️ QR vs text rendering:** gemini-prd.md describes QR code frames, but the MP4RAGEncoder code in Appendix II renders text directly onto frames. These are two different approaches — QR is more robust for data integrity, text rendering is simpler. The QR approach (from Section 3.4) appears to be the intended production approach. +- **⚠️ Vector storage location:** opus-prd2-v3.md mentions `hnsw_index: true` as a MemVid feature, but gemini-prd.md describes separate FAISS indices. These may be complementary (HNSW within FAISS). +- **⚠️ Sidecar vs embedded:** The code example uses sidecar files for index/embeddings, but the SQLite memvid_indices table provides a database-backed alternative. Both approaches may coexist. diff --git a/scratch/06-memory-hierarchy.md b/scratch/06-memory-hierarchy.md new file mode 100644 index 0000000..b66ddcf --- /dev/null +++ b/scratch/06-memory-hierarchy.md @@ -0,0 +1,123 @@ +# Topic: Three-Tiered Memory Hierarchy + +## Summary +The Hot/Warm/Cold memory architecture using ByteRover, Graphiti, and MemVid, including data lifecycle transitions and graduation protocols. + +--- + +## Architecture Overview + +> **Source:** [gemini-prd.md](../gemini-prd.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md) + +| Tier | Component | Role | Retention | Storage Format | Optimized For | +|------|-----------|------|-----------|---------------|---------------| +| Hot | ByteRover | Active Context | 0-24h | JSONL (filesystem) | Speed (grep/find) | +| Warm | Graphiti | Knowledge Graph | 7-90d | Property Graph (FalkorDB) | Relationships | +| Cold | MemVid | Deep Archive | 90d+ | H.265 video (QR frames) | Compression/Density | + +--- + +## Hot Memory: ByteRover + +> **Source:** [gemini-prd.md](../gemini-prd.md) + +- **Type:** Filesystem-based active context +- **Location:** `~/.byterover/inbox/` +- **Format:** JSONL with strict Pydantic schema +- **Schema fields:** type, summary, content, tags, timestamp +- **Purpose:** Stores live "Working Memory," active Git branches, and "in-flight" ideas +- **Optimization:** Speed via grep/find (no database overhead) +- **Retention:** Purged nightly unless related to active Git branch + +### Live Usage +During active work, ByteRover uses simple paragraph chunks — fast and good enough for real-time context injection. + +--- + +## Warm Memory: Graphiti + +> **Source:** [gemini-prd.md](../gemini-prd.md) + +- **Type:** Temporal Knowledge Graph +- **Backend:** FalkorDB (via Bolt Protocol) +- **Node Types:** Concept, Pattern, Decision, DecisionNode, PatternNode +- **Edge Types:** IMPLEMENTS, DEPRECATES, DEPENDS_ON, MITIGATES +- **Purpose:** Stores structured relationships, "Skill" storage, and "Lineage" +- **Retention:** 7-90 days active; nodes >30 days inactive become candidates for archival + +### Tombstone Pointers +When nodes are archived to MemVid, they are replaced with lightweight "Tombstone Pointers" (e.g., `See Archive W42`) to maintain graph connectivity. + +--- + +## Cold Memory: MemVid + +> **Source:** [gemini-prd.md](../gemini-prd.md), [opus-prd2-v3.md](../opus-prd2-v3.md) + +See [scratch/05-memvid-storage.md](./05-memvid-storage.md) for full details. + +- **Type:** Deep archive with video-encoded storage +- **Format:** H.265 compressed video with QR frames +- **Vector Index:** Quad-encoded (4 FAISS indices per video) +- **Metadata:** Sidecar JSON + SQLite memvid_indices table +- **Retention:** Permanent + +--- + +## Transition A: The "Digest" (Hot → Warm) + +> **Source:** [gemini-prd.md](../gemini-prd.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md) + +- **Trigger:** Nightly "Sleep Cycle" Daemon (or system idle > 15 minutes) +- **Input:** Raw interaction logs from ByteRover (cleaned via Proxy) +- **Process (The Dreamer):** + 1. **Structuring:** Convert raw logs into strict Graphiti Nodes (DecisionNode, PatternNode) + 2. **Filtering:** Discard "chatter" (conversational noise). Keep only "Solved Problems" and "Architectural Decisions" +- **Output:** New Nodes added to Graphiti. Raw logs purged from ByteRover (unless related to active Git branch) + +--- + +## Transition B: The "Freeze" (Warm → Cold) + +> **Source:** [gemini-prd.md](../gemini-prd.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md) + +- **Trigger:** Weekly "Archivist" Job (Sunday) +- **Input:** Stale Graphiti nodes (>30 days inactive) + Curated "Gold Standard" datasets +- **Process (The Renderer):** + 1. **Deconstruction:** Serialize nodes into JSON + 2. **Rendering:** Generate QR Code images (PNGs) of the JSON data + 3. **Quad-Encoding:** Generate 4 vector layers (Token, Fact, Context, Boundary) + 4. **Stitching:** Compile images into an H.265 `.mp4` video file +- **Output:** Portable MemVid archive file. Stale nodes replaced with Tombstone Pointers in Graphiti. + +--- + +## Data Flow Diagram + +``` +User Input → Proxy → ByteRover (Hot, 0-24h) + ↓ [Nightly Digest] + Graphiti (Warm, 7-90d) + ↓ [Weekly Freeze] + MemVid (Cold, 90d+) +``` + +--- + +## Implementation Requirements + +1. Implement ByteRover filesystem layer with JSONL read/write and Pydantic validation +2. Set up FalkorDB container for Graphiti (docker-compose) +3. Define Graphiti node and edge schemas +4. Implement the "Digest" transition daemon (Hot → Warm) +5. Implement the "Freeze" transition daemon (Warm → Cold) +6. Build Tombstone Pointer system for archived nodes +7. Implement idle detection trigger (system idle > 15 minutes) + +--- + +## Conflicts / Ambiguities + +- **⚠️ Retention periods:** gemini-prd.md says Hot is 0-24h and Warm is 7-90h (hours), but UNIFIED_PRD.md says Warm is 7-90d (days) and the Freeze trigger is >30 days inactive. The "h" in gemini-prd.md appears to be a typo — days is the intended unit based on context. +- **⚠️ Graph database:** gemini-prd.md specifies FalkorDB via Bolt Protocol. No other document confirms this choice. The containerization section mentions docker-compose for FalkorDB. +- **⚠️ ByteRover location:** gemini-prd.md uses `~/.byterover/inbox/` but this is macOS-specific. Should be configurable. diff --git a/scratch/07-retrieval-pipeline.md b/scratch/07-retrieval-pipeline.md new file mode 100644 index 0000000..d06d412 --- /dev/null +++ b/scratch/07-retrieval-pipeline.md @@ -0,0 +1,169 @@ +# Topic: Retrieval Pipeline + +## Summary +Two-stage retrieval system with broad recall, hybrid search, precision reranking, and cross-modal retrieval capabilities. + +--- + +## Pipeline Overview + +> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md) + +``` +Query → Stage 1: Recall (top 100) → Hybrid Search → Stage 2: Rerank (top 10) → Results +``` + +--- + +## Stage 1: Broad Recall + +> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md) + +```yaml +recall: + model: "primary" # Qwen3-VL-Embedding-8B + top_k: 100 + similarity_threshold: 0.5 + multimodal_query_support: true +``` + +- Embed the query using Qwen3-VL-Embedding-8B +- Retrieve top 100 candidates by vector similarity +- Minimum similarity threshold: 0.5 +- Supports multimodal queries (text, image, mixed) + +--- + +## Hybrid Search + +> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md) + +```yaml +hybrid: + enabled: true + vector_weight: 0.7 + keyword_weight: 0.3 + keyword_method: "bm25" +``` + +- Combines vector similarity (70%) with BM25 keyword matching (30%) +- Improves recall for exact-match queries that pure vector search might miss + +--- + +## Stage 2: Precision Reranking + +> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md), [opus-prd1-v3.md](../opus-prd1-v3.md) + +```yaml +reranking: + enabled: true + model: "reranker" # Qwen3-VL-Reranker-8B + top_k_input: 100 + top_k_output: 10 + multimodal_rerank: true +``` + +- Takes 100 candidates from recall stage +- Uses Qwen3-VL-Reranker-8B (Single-Tower, Cross-Attention) +- Outputs top 10 most relevant results +- Supports multimodal reranking (query and documents can be mixed-modal) +- Relevance score via yes/no token generation probability + +--- + +## Cross-Modal Retrieval + +> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md), [opus-prd1-v3.md](../opus-prd1-v3.md) + +```yaml +cross_modal: + enabled: true + query_modalities: ["text", "image", "mixed"] + result_modalities: ["text", "image", "code", "mixed"] +``` + +Enables queries like: +- Text query → retrieve images/code/video +- Image query → retrieve related text/code +- Mixed query (text + image) → retrieve any modality + +> **Source:** [opus-prd3-v3.md](../opus-prd3-v3.md) + +Example cross-modal queries: +- "Find the diagram referenced by the code comment describing the vectorized kernel" +- "Find video clips illustrating the algorithm described in Section 3.2" +- "Find the code implementing the architecture in this screenshot" + +--- + +## Performance Targets + +> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md) + +| Metric | Target | +|--------|--------| +| Max latency | 550ms (text), 600ms (image), 700ms (mixed) | +| Min relevance score | 0.6 | + +--- + +## MemVid "Zoom" Pattern Retrieval + +> **Source:** [gemini-prd.md](../gemini-prd.md) + +For MemVid (Cold Memory) retrieval, use cascading lookup across quad-encoded layers: + +1. **Scout (Paragraph Layer):** Find general concepts — broad context +2. **Snipe (Sentence Layer):** Check specific facts in the region +3. **Stitch (Boundary Layer):** Retrieve boundary vectors to see cross-section connections + +--- + +## Query Classification & Routing + +> **Source:** [chatgpt5.2-prd.md](../chatgpt5.2-prd.md) + +The retrieval system needs to: +1. Classify the user's query intent +2. Determine which domain(s) to search (prompts, codebase, research) +3. Select appropriate retrieval strategy based on query type +4. Route to the correct MemVid file(s) and chunking method indices + +### Model for Query Classification +> **Source:** [chatgpt5.2-prd.md](../chatgpt5.2-prd.md) + +A model is needed to categorize user requests and determine which RAG ingestion methodology to use for retrieval. The user suggests this could potentially be done recursively/iteratively. + +--- + +## Context Injection (Proxy Integration) + +> **Source:** [gemini-prd.md](../gemini-prd.md) + +The Proxy enriches queries by: +1. Running intent classification +2. Querying Graphiti (Warm) + ByteRover (Hot) +3. Prepending relevant context as a "System Note" + +--- + +## Implementation Requirements + +1. Implement vector similarity search (FAISS/HNSW) +2. Implement BM25 keyword search +3. Build hybrid search score combiner (0.7 vector + 0.3 keyword) +4. Integrate Qwen3-VL-Reranker-8B for precision reranking +5. Build cross-modal query support +6. Implement query classification/routing logic +7. Build the "Zoom" pattern for MemVid retrieval +8. Implement latency monitoring and optimization +9. Build retrieval analytics tracking (retrieval_events table) + +--- + +## Conflicts / Ambiguities + +- **⚠️ Query routing model:** chatgpt5.2-prd.md asks what model class is needed for query classification but doesn't specify one. No other document provides a concrete answer. This needs to be determined during implementation. +- **⚠️ Latency targets vary:** 550ms for text queries, 600ms for image, 700ms for mixed — these are from verification_queries in opus-prd2-v3.md. The general target is 550ms. Mixed-modal queries may need relaxed targets. +- **⚠️ Sub-second vs 550ms:** chatgpt5.2-prd.md mentions "sub-second latency" as a goal; opus-prd2-v3.md specifies 550ms. These are compatible but 550ms is the stricter target. diff --git a/scratch/08-orchestration-concurrency.md b/scratch/08-orchestration-concurrency.md new file mode 100644 index 0000000..ac88d19 --- /dev/null +++ b/scratch/08-orchestration-concurrency.md @@ -0,0 +1,160 @@ +# Topic: Orchestration & Concurrency + +## Summary +Headless agentic orchestration via Claude Code, concurrency settings, batching strategies, MCP server configurations, and the agentic swarm architecture. + +--- + +## Headless Operation + +> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md), [chatgpt5.2-prd.md](../chatgpt5.2-prd.md) + +```yaml +orchestration: + headless: + enabled: true + logic_file: "orchestration_logic_v3.md" + checkpoint_interval_minutes: 5 +``` + +- Orchestration logic is defined in natural language (markdown file) +- Provider-agnostic — designed for Anthropic agentic SDK +- Deployed via Claude Code headless CLI with monitoring layer (e.g., autoclot) +- Uses `.claude/claude.md` based skills, plugins, and MCP tools + +--- + +## Concurrency Settings + +> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md) + +```yaml +concurrency: + max_files: 50 + modality_detector_workers: 2 + content_router_workers: 2 + code_specialist_workers: 8 + text_specialist_workers: 8 + multimodal_specialist_workers: 4 + graph_builder_workers: 2 + integration_workers: 2 +``` + +### Worker Roles + +| Worker Type | Count | Purpose | +|------------|-------|---------| +| Modality Detector | 2 | Detect content modalities in incoming files | +| Content Router | 2 | Route content to appropriate chunking pipeline | +| Code Specialist | 8 | AST parsing, code chunking, dependency extraction | +| Text Specialist | 8 | Semantic/recursive/sentence chunking for text | +| Multimodal Specialist | 4 | Multimodal boundary detection, screenshot-code fusion | +| Graph Builder | 2 | Entity extraction, relationship building, semantic neighbors | +| Integration | 2 | Final assembly, quality validation, storage | + +--- + +## Batching Configuration + +> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md) + +```yaml +batching: + embedding_batch_size: 16 # Smaller for multimodal (memory constraints) + integration_buffer_size: 50 + integration_flush_timeout_ms: 5000 +``` + +--- + +## MCP Server Configuration + +> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md) + +| MCP Server | Description | Config | +|-----------|-------------|--------| +| filesystem-mcp | Sandboxed file system access | — | +| git-mcp | Git history for provenance | — | +| embedding-mcp | Unified multimodal embedding API | default_model: Qwen3-VL-Embedding-8B | +| memvid-mcp | Video-encoded vector storage | — | +| entity-mcp | Named entity extraction | — | + +--- + +## Agentic Swarm Architecture + +> **Source:** [chatgpt5.2-prd.md](../chatgpt5.2-prd.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md) + +### Ingestion Pipeline Agents + +The orchestrator coordinates specialized agents for the ingestion pipeline: + +1. **Modality Detector Agent** — Classifies incoming content type and modalities +2. **Content Router Agent** — Routes to appropriate chunking strategy +3. **Chunking Agents** (per method): + - Fixed-size chunker (programmatic, no LLM) + - Sentence-based chunker (programmatic, no LLM) + - Semantic chunker (uses Qwen3-Embedding-0.6B for boundary detection) + - Recursive hierarchical chunker (may need Sonnet/Pro-class LLM) + - AST structural chunker (tree-sitter based, programmatic) + - Multimodal boundary chunker (needs VL model) + - Screenshot-code fusion chunker (needs VL model + OCR) +4. **Entity Extraction Agent** — NER and relationship extraction +5. **Graph Builder Agent** — Knowledge graph construction +6. **Quality Validation Agent** — Chunk quality scoring +7. **Integration Agent** — Final assembly and storage + +### Asynchronous Processing +> **Source:** [chatgpt5.2-prd.md](../chatgpt5.2-prd.md) + +- Files should be processed asynchronously (multi-file corpus) +- Different chunking methods can run in parallel on different files +- Embedding can be batched across chunks + +--- + +## Orchestration Logic File + +> **Source:** [chatgpt5.2-prd.md](../chatgpt5.2-prd.md) + +The orchestration logic should be: +- Natural language (provider-agnostic) +- Markdown-based configuration +- Compatible with Anthropic agentic SDK +- Deployable via Claude Code headless CLI + +--- + +## Key Directories + +> **Source:** [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md) + +``` +src/models/ — Qwen3-VL model implementations +src/chunking/ — Multi-strategy chunking algorithms +src/memvid/ — Video-encoded storage system +src/graphiti/ — Knowledge graph implementation +src/orchestration/ — Agentic swarm coordination logic +src/retrieval/ — Two-stage retrieval pipeline +``` + +--- + +## Implementation Requirements + +1. Create orchestration logic markdown file +2. Implement worker pool with configurable concurrency +3. Build content routing logic (content type → chunking method) +4. Implement async file processing pipeline +5. Set up MCP server integrations +6. Build checkpoint/resume system (5-minute intervals) +7. Implement embedding batching with configurable batch size +8. Create monitoring/logging for worker status + +--- + +## Conflicts / Ambiguities + +- **⚠️ Agent vs programmatic:** chatgpt5.2-prd.md envisions LLM agents for semantic/hierarchical chunking, but opus-prd2-v3.md treats these as algorithmic processes with configurable parameters. The implementation should use algorithmic approaches with LLM fallback for edge cases. +- **⚠️ Orchestration tool:** chatgpt5.2-prd.md mentions "autoclot or something" as monitoring layer. This is vague — the specific monitoring tool needs to be determined. +- **⚠️ Worker counts:** The concurrency settings (8 code + 8 text + 4 multimodal = 20 specialist workers) assume significant compute resources. May need to be tuned for M3 Max MacBook Pro. diff --git a/scratch/09-proxy-shim.md b/scratch/09-proxy-shim.md new file mode 100644 index 0000000..ea8b033 --- /dev/null +++ b/scratch/09-proxy-shim.md @@ -0,0 +1,103 @@ +# Topic: Proxy / Shim (The Gatekeeper) + +## Summary +The Claude-Proxy wrapper that intercepts user prompts and model outputs, injects context from memory, and sanitizes data for storage. This is the central hub of the system. + +--- + +## Role + +> **Source:** [gemini-prd.md](../gemini-prd.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md) + +| Component | Role | Responsibility | +|-----------|------|---------------| +| The Proxy (Shim) | The Gatekeeper | Wraps `claude` command. Intercepts user prompts and model outputs. Injects context from memory. Sanitizes data via Pydantic schemas before storage. | + +--- + +## Proxy Logic Flow + +> **Source:** [gemini-prd.md](../gemini-prd.md) (Section 3.2) + +1. **Intercept:** Capture `stdin` (User Prompt) +2. **Enrich:** + - Run Classification (Intent Detection) + - Query Graphiti (Warm) + ByteRover (Hot) + - Inject: Prepend relevant context as a "System Note" +3. **Execute:** Pass modified payload to the real `claude` binary +4. **Capture:** Read the resulting `stdout` and log files +5. **Sanitize:** Pass output to Local LLM (Structure Gate) to strip noise +6. **Ingest:** Write structured JSON to `~/.byterover/inbox/` + +--- + +## Architecture + +> **Source:** [gemini-prd.md](../gemini-prd.md) (Section 3.1) + +The Claude-Proxy (Python) sits at the center with spokes: + +- **North:** StdIO Interface (User Terminal) +- **South:** Anthropic API (Claude Code Execution) +- **East (Storage):** + - ByteRover Interface (File I/O) + - Graphiti Interface (Bolt Protocol to FalkorDB) + - MemVid Interface (FFmpeg + FAISS) +- **West (Compute):** + - OpenRouter API (Sleep-Time Models) + - Local LLM (Ollama — Pydantic Guardrails) + +--- + +## User Experience + +> **Source:** [gemini-prd.md](../gemini-prd.md) (Section 1.6) + +- **Transparent Operation:** User types `claude` as normal. Proxy handles all complexity invisibly. +- **Context Injection:** "God Mode" automatically prepends relevant Hot/Warm memory based on intent classification. +- **Feedback Loop:** If user explicitly praises/scolds the agent, the Proxy tags that interaction for high-priority processing by the Tribunal during sleep. + +--- + +## Installation + +> **Source:** [gemini-prd.md](../gemini-prd.md) (Section 3.5) + +A single `install.sh` script that: +1. Sets up Python `venv` +2. Installs `ffmpeg`, `ghostscript` (for QR) +3. Aliases `claude` to `python ~/.bin/claude_proxy.py` +4. Registers the `sleep_daemon` with `launchd` + +--- + +## Data Sanitization + +> **Source:** [gemini-prd.md](../gemini-prd.md) + +- Uses Pydantic schemas for strict data validation +- Local LLM (Ollama) acts as Structure Gate to strip conversational noise +- Output format: Structured JSON with fields: type, summary, content, tags, timestamp + +--- + +## Implementation Requirements + +1. Create `claude_proxy.py` wrapper script +2. Implement stdin/stdout interception +3. Build intent classification module +4. Implement context retrieval from ByteRover (Hot) and Graphiti (Warm) +5. Build context injection (System Note prepending) +6. Implement output capture and sanitization via Pydantic +7. Build Local LLM integration (Ollama) for structure gating +8. Create JSONL writer for ByteRover inbox +9. Implement feedback detection (praise/scold tagging) +10. Create `install.sh` for turnkey setup + +--- + +## Conflicts / Ambiguities + +- **⚠️ Local LLM dependency:** The proxy requires a local LLM (Ollama) for sanitization. This adds a dependency that may not be available on all systems. Could be made optional with a simpler regex-based fallback. +- **⚠️ macOS-specific:** `launchd` registration is macOS-only. Linux would need systemd, Windows would need a service. Should be abstracted. +- **⚠️ Claude binary wrapping:** Assumes the `claude` CLI binary exists and can be wrapped. The exact interception mechanism depends on Claude Code's CLI interface. diff --git a/scratch/10-sleep-time-compute.md b/scratch/10-sleep-time-compute.md new file mode 100644 index 0000000..9e1bd29 --- /dev/null +++ b/scratch/10-sleep-time-compute.md @@ -0,0 +1,138 @@ +# Topic: Sleep-Time Compute & Self-Improvement Loops + +## Summary +The autonomous background processing system that runs during idle/sleep periods to refine skills, process memories, and improve the system without human intervention. + +--- + +## Overview + +> **Source:** [gemini-prd.md](../gemini-prd.md) (Sections 1.5, 3.3) + +Sleep-Time Compute is the engine of self-improvement. It operates autonomously to upgrade the system's intelligence during idle periods, using free/cheap models via OpenRouter. + +--- + +## Sleep-Time Daemon Architecture + +> **Source:** [gemini-prd.md](../gemini-prd.md) (Section 3.3) + +A background service (`launchd` on macOS) with a state machine: + +| State | Trigger | Action | +|-------|---------|--------| +| IDLE | Default | Monitoring system load | +| DREAMING | Daily | Processing ByteRover inbox → Graphiti | +| EVOLVING | Nightly | Full refinement loop (5 steps) | + +### EVOLVING Steps +1. **Curiosity Module** generates task list (identifies knowledge gaps) +2. **Creator** generates artifacts via OpenRouter (free models) +3. **Tribunal** (Parallel Async) critiques artifacts +4. **Mutator** updates `~/.skills/*.md` files +5. **Archivist** renders approved artifacts to `~/.memvid/staging` + +--- + +## Loop 1: The Simulator (Correction) + +> **Source:** [gemini-prd.md](../gemini-prd.md) (Section 1.5) + +- **Input:** Failed tests/specs from the day's active work +- **Action:** Spawns a temporary git branch. Retries the failed spec using infinite time/retries. +- **Result:** Upon success, creates a "Solution Node" in Graphiti +- **Purpose:** Automatically fixes failures encountered during the day + +--- + +## Loop 2: The Professor (Synthesis) + +> **Source:** [gemini-prd.md](../gemini-prd.md) (Section 1.5) + +- **Input:** High-quality external repositories (e.g., `shadcn/ui`, `actix-web`) +- **Action:** "Reverse Engineers" the code to generate Synthetic PRDs +- **Result:** Stores pairs of `{Synthetic_PRD} -> {Perfect_Code}` in MemVid for future RAG retrieval +- **Purpose:** Learns from exemplary codebases + +--- + +## Loop 3: The Evolutionary Forge (Creation) + +> **Source:** [gemini-prd.md](../gemini-prd.md) (Section 1.5) + +- **Input:** "Madlib" Inspiration Queue (Randomized Topic + Style + Constraint) +- **Action:** + 1. **Draft:** Creator Model generates artifact + 2. **Gate:** Taste Oracle checks novelty (rejects if too similar/dissimilar to Gold Standard) + 3. **Critique:** Tribunal (Personas) attacks the draft + 4. **Mutate:** If score < 95, Mutator rewrites the Skill File (Prompt) +- **Result:** A graduated "Skill File" v2.0 and a high-quality artifact for the archive + +--- + +## The Tribunal (The Critic) + +> **Source:** [gemini-prd.md](../gemini-prd.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md) + +A dynamic graph of adversarial personas that critique generated artifacts: +- **Security Zealot** — Attacks security vulnerabilities +- **Pedant** — Checks correctness and precision +- **Visionary** — Evaluates innovation and forward-thinking + +Runs in parallel async during sleep cycles. + +--- + +## The Mutator (The Evolution) + +> **Source:** [gemini-prd.md](../gemini-prd.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md) + +- Uses Genetic Algorithms to rewrite "Skill Files" (prompts) +- Based on Tribunal feedback scores +- Skill Files stored at `~/.skills/*.md` + +--- + +## The Taste Oracle (The Quality Gate) + +> **Source:** [gemini-prd.md](../gemini-prd.md), [docs/UNIFIED_PRD.md](../docs/UNIFIED_PRD.md) + +- Vector-based novelty detector +- Compares outputs against a "Gold Standard" baseline in MemVid +- Rejects derivative or hallucinated work +- Uses cosine similarity in embedding space + +--- + +## Cost Strategy + +> **Source:** [gemini-prd.md](../gemini-prd.md) (Section 1.2) + +- Uses Free/OpenRouter tiers for "Heavy Hitter" models during sleep cycles +- Models: DeepSeek, Qwen, Mistral (free tier) +- Zero-cost autonomous improvement + +--- + +## Implementation Requirements + +1. Implement sleep-time daemon (background service) +2. Build state machine (IDLE → DREAMING → EVOLVING) +3. Implement idle detection (system idle > 15 minutes) +4. Build the Simulator loop (failed test retry on temp branches) +5. Build the Professor loop (external repo analysis → synthetic PRDs) +6. Build the Evolutionary Forge loop (creation + critique + mutation) +7. Implement the Tribunal with configurable adversarial personas +8. Implement the Mutator with genetic algorithm-based prompt rewriting +9. Implement the Taste Oracle with vector novelty detection +10. Build the Curiosity Module (knowledge gap detection) +11. Create Skill File management system (`~/.skills/*.md`) +12. Integrate with OpenRouter free tier for sleep-time models + +--- + +## Conflicts / Ambiguities + +- **⚠️ Curiosity Module vs Madlib:** The Curiosity Module (Active Inference) is listed as a strategic integration in gemini-prd.md Section 2, replacing the random "Madlib" generator. But Loop 3 still references "Madlib Inspiration Queue." The Curiosity Module is the intended upgrade path. +- **⚠️ Score threshold:** Loop 3 uses "score < 95" as the mutation threshold. This seems very high — may need calibration. +- **⚠️ Platform dependency:** `launchd` is macOS-only. Needs cross-platform daemon support. diff --git a/scratch/11-quality-assurance.md b/scratch/11-quality-assurance.md new file mode 100644 index 0000000..b277f0b --- /dev/null +++ b/scratch/11-quality-assurance.md @@ -0,0 +1,120 @@ +# Topic: Quality Assurance & Error Handling + +## Summary +Chunk validation, coherence scoring, verification queries, outlier detection, and error handling procedures for the RAG v3.0 system. + +--- + +## Chunk Validation + +> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md) + +```yaml +quality: + validation: + validate_all_chunks: true + min_coherence_score: 0.6 + min_completeness_score: 0.5 + flag_outlier_embeddings: true +``` + +- All chunks are validated after creation +- Minimum coherence score: 0.6 (how well the chunk holds together semantically) +- Minimum completeness score: 0.5 (whether the chunk contains a complete thought) +- Outlier embeddings are flagged for review + +--- + +## Quality Metadata (Dimension 11) + +> **Source:** [docs/SCHEMA_REFERENCE.md](../docs/SCHEMA_REFERENCE.md) + +Each chunk carries quality metadata: + +| Field | Type | Description | +|-------|------|-------------| +| confidence_score | number | Overall confidence (0-1) | +| validation_status | enum | valid, warning, error, pending | +| error_flags | string[] | List of detected issues | +| review_status | enum | auto_approved, needs_review, reviewed, rejected | +| chunking_quality.coherence_score | number | Semantic coherence | +| chunking_quality.completeness_score | number | Thought completeness | +| chunking_quality.boundary_quality | number | How clean the chunk boundaries are | + +--- + +## Verification Queries + +> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md) + +Post-ingestion verification queries to validate the system works correctly: + +| Query | Type | Expected | Max Latency | +|-------|------|----------|-------------| +| "database connection setup" | text | codebase domain | 550ms | +| "system architecture diagram" | text | image/mixed modalities | 600ms | +| "Find code that implements this UI" + test_screenshot.png | mixed | codebase domain | 700ms | + +--- + +## Error Handling + +> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md) + +### API Rate Limiting +```yaml +api_rate_limit: + initial_backoff_seconds: 5 + max_backoff_seconds: 60 + max_retries: 5 +``` + +### Embedding Failures +```yaml +embedding_failure: + retry_count: 3 + fallback_to_text_only: true +``` +- Retry up to 3 times +- Fall back to text-only Qwen3-Embedding-8B if multimodal embedding fails + +### Parse Failures +```yaml +parse_failure: + log_file: "ingestion_errors.log" + continue_on_error: true + quarantine_failed: true +``` +- Log errors to `ingestion_errors.log` +- Continue processing other files on error +- Quarantine failed files for manual review + +### Multimodal Failures +```yaml +multimodal_failure: + fallback_to_text_only: true + log_visual_errors: true +``` +- Fall back to text-only processing +- Log visual processing errors separately + +--- + +## Implementation Requirements + +1. Implement chunk coherence scoring algorithm +2. Implement chunk completeness scoring algorithm +3. Build outlier embedding detection (statistical outlier in vector space) +4. Create validation pipeline that runs after each chunk creation +5. Implement verification query test suite +6. Build exponential backoff retry logic for API calls +7. Implement fallback chain (multimodal → text-only) +8. Create error quarantine system for failed files +9. Build ingestion error logging + +--- + +## Conflicts / Ambiguities + +- **⚠️ Scoring algorithms undefined:** The documents specify minimum scores (0.6 coherence, 0.5 completeness) but don't define how these scores are computed. Implementation needs to determine the scoring methodology (e.g., embedding-based coherence, LLM-based completeness). +- **⚠️ Outlier detection method:** "flag_outlier_embeddings" is specified but the detection method (z-score, IQR, isolation forest, etc.) is not defined. diff --git a/scratch/12-domain-configuration.md b/scratch/12-domain-configuration.md new file mode 100644 index 0000000..5a21faa --- /dev/null +++ b/scratch/12-domain-configuration.md @@ -0,0 +1,125 @@ +# Topic: Domain Configuration & Content Routing + +## Summary +Configuration for the three content domains (prompts, codebase, research), including per-domain chunking methods, storage files, retention policies, and content type detection. + +--- + +## Domain Definitions + +> **Source:** [opus-prd2-v3.md](../opus-prd2-v3.md) + +### Domain: Prompts +```yaml +- name: "prompts" + description: "User inputs and prompts to LLMs" + storage: "prompts.mp4" + chunking_methods: + - "semantic" + - "fixed_size" + retention: "30_days_rolling" + multimodal: false +``` + +### Domain: Codebase +```yaml +- name: "codebase" + description: "Multi-repository source code and configs" + storage: "codebase.mp4" + chunking_methods: + - "ast_structural" + - "fixed_size" + - "screenshot_code_fusion" + retention: "version_controlled" + multimodal: true + cross_reference: true +``` + +### Domain: Research +```yaml +- name: "research" + description: "Research papers, documentation, diagrams" + storage: "research.mp4" + chunking_methods: + - "recursive_hierarchical" + - "semantic" + - "multimodal_boundary" + retention: "permanent" + multimodal: true +``` + +--- + +## Content Type Detection + +> **Source:** [chatgpt5.2-prd.md](../chatgpt5.2-prd.md), [opus-prd2-v3.md](../opus-prd2-v3.md) + +### Supported File Types +- **Markdown (.md)** — Primary format for writing/documentation +- **Python (.py)** — Code +- **JavaScript/TypeScript (.js/.ts)** — Code +- **DOCX** — Documents (not large portion) +- **PDF** — Research papers (not large portion) +- **Config files** — Various formats + +### Content Type Mapping + +| File Type | Content Type | Domain | Chunking Methods | +|-----------|-------------|--------|-----------------| +| .md (writing) | documentation | research | recursive_hierarchical, semantic, multimodal_boundary | +| .md (prompts) | prompt | prompts | semantic, fixed_size | +| .py, .js, .ts | code | codebase | ast_structural, fixed_size | +| .json, .yaml, .toml | configuration | codebase | fixed_size | +| .pdf | research_paper | research | recursive_hierarchical, semantic | +| .docx | documentation | research | recursive_hierarchical, semantic | + +--- + +## Multi-Repository Setup + +> **Source:** [chatgpt5.2-prd.md](../chatgpt5.2-prd.md) + +- Codebase is stored in a multi-repo setup (not monorepo) +- Each repository should be tracked separately for provenance +- Git metadata (commit SHA, branch, author) captured per chunk + +--- + +## Corpus Size Estimates + +> **Source:** [chatgpt5.2-prd.md](../chatgpt5.2-prd.md), [gemini-prd.md](../gemini-prd.md) + +| Metric | Value | +|--------|-------| +| Initial text corpus | ~35MB | +| Initial documents (gemini estimate) | 500 docs × 150 pages = 75,000 pages | +| Weekly growth (gemini estimate) | +100 docs × 150 pages = +15,000 pages/week | + +--- + +## Retention Policies + +| Domain | Policy | Description | +|--------|--------|-------------| +| Prompts | 30-day rolling | Older prompts archived to MemVid | +| Codebase | Version-controlled | Tied to git history, never deleted | +| Research | Permanent | Always retained | + +--- + +## Implementation Requirements + +1. Implement content type detector (file extension + content analysis) +2. Build domain router (content type → domain → chunking methods) +3. Configure per-domain MemVid files +4. Implement retention policy enforcement +5. Build multi-repo ingestion support with git provenance tracking +6. Create domain-specific embedding instructions (instruction-aware model) + +--- + +## Conflicts / Ambiguities + +- **⚠️ Corpus size discrepancy:** chatgpt5.2-prd.md says ~35MB of text files; gemini-prd.md estimates 75,000 pages initially with 15,000 pages/week growth. These may refer to different corpora or different time horizons. +- **⚠️ Prompt vs documentation detection:** Both prompts and documentation can be markdown files. The routing logic needs a way to distinguish user prompts from documentation (possibly by source directory or metadata). +- **⚠️ Code tokenization:** chatgpt5.2-prd.md asks whether different tokenization is needed for config vs script vs library files. The opus-prd2 config uses fixed_size for configs and ast_structural for code, which implicitly answers this. diff --git a/scratch/13-strategic-integrations.md b/scratch/13-strategic-integrations.md new file mode 100644 index 0000000..80ab5e9 --- /dev/null +++ b/scratch/13-strategic-integrations.md @@ -0,0 +1,96 @@ +# Topic: Strategic Integrations (Advanced Features) + +## Summary +Five advanced integrations to push the architecture from "Advanced" to "State-of-the-Art": Hypergraph Knowledge, Active Inference, Formal Verification, Model Merging, and Contrastive Value Alignment. + +--- + +## Overview + +> **Source:** [gemini-prd.md](../gemini-prd.md) (Section 2) + +These are future-proofing add-ons, not core requirements. They represent the upgrade path from the base system. + +--- + +## 1. Hypergraph Knowledge Representation + +> **Source:** [gemini-prd.md](../gemini-prd.md) + +- **What:** Standard graphs use triplets (A → B). Hypergraphs allow a single edge to connect _multiple_ nodes (Code + PRD + Timestamp + Author). +- **Why:** Code is rarely binary. A function depends on a library, a requirement, and a specific node version simultaneously. +- **Integration:** Use Hypergraph RAG in the Graphiti layer to allow "n-ary" relationships, reducing the number of "hops" needed to understand complex dependencies. +- **Implementation:** Update Graphiti schema to support "Hyperedges" (Node-to-Edge connections) + +--- + +## 2. Active Inference Curiosity Module (Frisstonian AI) + +> **Source:** [gemini-prd.md](../gemini-prd.md) + +- **What:** Replaces the random "Madlib" generator in the Evolutionary Forge. The agent calculates "Free Energy" (uncertainty) across its knowledge base. +- **Why:** The agent should learn _what it realizes it doesn't know_. If it knows React but not Svelte, the Curiosity Module detects that gap and generates a targeted learning task. +- **Integration:** A "Curiosity Daemon" runs before Sleep Time, identifying sparse areas in the Graphiti vector space and generating targeted learning tasks. + +--- + +## 3. Formal Verification Gate (VeriGuard Protocol) + +> **Source:** [gemini-prd.md](../gemini-prd.md) + +- **What:** Uses a mathematical prover (Coq, Lean, or lightweight Python-based CrossHair) to verify code correctness. +- **Why:** "95% Confidence" is subjective. "Mathematically Proven" is absolute. +- **Integration:** The Tribunal gains a **Math-Persona** that demands the agent write assertions. If assertions fail formal verification, the artifact is rejected immediately. +- **Implementation:** Add a `verify.py` hook in the Tribunal loop. + +--- + +## 4. Automated Model Merging (The "Frankenstein" Strategy) + +> **Source:** [gemini-prd.md](../gemini-prd.md) + +- **What:** Techniques like TIES-Merging or DARE allow merging weights of different fine-tuned models without retraining. +- **Why:** Instead of just refining prompts, the system can merge a "Security Expert" LoRA with a "Creative Writer" LoRA to create a custom daily driver. +- **Integration:** Monthly script that checks HuggingFace for compatible LoRAs and merges them. +- **Frequency:** Once a month + +--- + +## 5. Contrastive Value Alignment (Taste Oracle++) + +> **Source:** [gemini-prd.md](../gemini-prd.md) + +- **What:** Uses a learned Reward Model based on user-specific "Taste" vectors (Contrastive Learning). +- **Why:** Simple vector distance is a crude proxy for "Good." A trained Reward Model can learn the nuance of why you like "Brutalist" code but dislike "Spaghetti" code, even if they look vectorially similar. +- **Integration:** Train a small classifier (e.g., DeBERTa) on "Accepted" vs. "Rejected" tribunal outcomes to act as a highly accurate pre-filter for the Creation loop. + +--- + +## Implementation Priority + +These are listed in suggested implementation order (after core system is built): + +1. **Hypergraph Knowledge** — Enhances existing Graphiti layer (medium complexity) +2. **Active Inference Curiosity** — Replaces Madlib generator (medium complexity) +3. **Contrastive Value Alignment** — Improves Taste Oracle (medium complexity) +4. **Formal Verification** — Adds verification hook (small complexity) +5. **Model Merging** — Monthly automation (large complexity, requires ML expertise) + +--- + +## Implementation Requirements + +1. Research and select hypergraph library compatible with FalkorDB +2. Implement Free Energy calculation for knowledge gap detection +3. Integrate CrossHair or similar lightweight formal verifier +4. Build LoRA merging pipeline with HuggingFace integration +5. Train DeBERTa classifier on accepted/rejected outcomes +6. Create monthly automation for model merging + +--- + +## Conflicts / Ambiguities + +- **⚠️ These are aspirational:** Only gemini-prd.md describes these integrations. No other document references them. They should be treated as Phase 2+ features, not core requirements. +- **⚠️ Model merging feasibility:** Merging LoRAs requires access to model weights and significant ML infrastructure. May not be practical for a local-first system on M3 Max. +- **⚠️ Formal verification scope:** CrossHair (Python) is limited compared to Coq/Lean. The scope of what can be formally verified needs to be realistic. diff --git a/scratch/TASK_INDEX.md b/scratch/TASK_INDEX.md new file mode 100644 index 0000000..e5ff1f4 --- /dev/null +++ b/scratch/TASK_INDEX.md @@ -0,0 +1,122 @@ +# Task Index: RAG v3.0 Implementation Topics + +## Overview + +This index maps 13 topic-focused scratch files decomposed from the RAG v3.0 project documentation. Each file consolidates all requirements for a single topic from across 8 source documents. + +### Source Documents +| Document | Focus | +|----------|-------| +| `chatgpt5.2-prd.md` | Original requirements, cost analysis, agentic deployment | +| `gemini-prd.md` | Autodidactic Omni-Loop, memory hierarchy, sleep-time compute | +| `opus-prd1-v3.md` | RAG v3.0 architecture, embedding models, metadata schema | +| `opus-prd2-v3.md` | YAML configuration for all components | +| `opus-prd3-v3.md` | Foundational theory, multimodal revolution, chunking theory | +| `docs/UNIFIED_PRD.md` | Consolidated specification | +| `docs/SCHEMA_REFERENCE.md` | Database schema, TypeScript interfaces, config schema | +| `docs/AGGREGATION_PLAN.md` | Overlap/conflict analysis between documents | + +--- + +## Topic Index + +| # | File | Topic | Complexity | Dependencies | +|---|------|-------|-----------|-------------| +| 01 | [01-embedding-model-stack.md](./01-embedding-model-stack.md) | Embedding Model Stack | Large | None | +| 02 | [02-chunking-strategies.md](./02-chunking-strategies.md) | Chunking Strategies | Large | 01 | +| 03 | [03-metadata-schema.md](./03-metadata-schema.md) | Metadata Schema - 12 Dimensions | Large | None | +| 04 | [04-database-schema.md](./04-database-schema.md) | SQLite Database Schema | Medium | 03 | +| 05 | [05-memvid-storage.md](./05-memvid-storage.md) | MemVid Video-Encoded Storage | Large | 01, 04 | +| 06 | [06-memory-hierarchy.md](./06-memory-hierarchy.md) | Three-Tiered Memory Hierarchy | Medium | 05 | +| 07 | [07-retrieval-pipeline.md](./07-retrieval-pipeline.md) | Retrieval Pipeline | Large | 01, 04, 05 | +| 08 | [08-orchestration-concurrency.md](./08-orchestration-concurrency.md) | Orchestration & Concurrency | Medium | 02, 07 | +| 09 | [09-proxy-shim.md](./09-proxy-shim.md) | Proxy / Shim - The Gatekeeper | Medium | 06, 07 | +| 10 | [10-sleep-time-compute.md](./10-sleep-time-compute.md) | Sleep-Time Compute & Self-Improvement | Large | 06, 09 | +| 11 | [11-quality-assurance.md](./11-quality-assurance.md) | Quality Assurance & Error Handling | Small | 03, 04 | +| 12 | [12-domain-configuration.md](./12-domain-configuration.md) | Domain Configuration & Content Routing | Small | 02, 05 | +| 13 | [13-strategic-integrations.md](./13-strategic-integrations.md) | Strategic Integrations - Advanced | Large | 06, 10 | + +--- + +## Suggested Implementation Order + +### Stage 1: Foundation (No dependencies) +1. **03 - Metadata Schema** — Define all TypeScript interfaces and Pydantic models for the 12-dimension schema. This is the data contract everything else depends on. +2. **01 - Embedding Model Stack** — Set up Qwen3-VL-Embedding-8B, reranker, and boundary detection model. Core capability needed by all pipelines. +3. **04 - Database Schema** — Create SQLite tables, indexes, and data access layer. Depends on metadata schema being defined. + +### Stage 2: Chunking Pipeline +4. **02 - Chunking Strategies** — Implement all 7 chunking methods. Depends on embedding models for semantic chunking. +5. **12 - Domain Configuration** — Configure content routing (file type → domain → chunking methods). Depends on chunking strategies. +6. **11 - Quality Assurance** — Implement chunk validation, coherence scoring, error handling. Can run in parallel with chunking. + +### Stage 3: Storage & Retrieval +7. **05 - MemVid Storage** — Implement H.265 video encoding, QR frames, quad-encoding, FAISS indices. Depends on embedding models and database schema. +8. **07 - Retrieval Pipeline** — Two-stage retrieval with hybrid search and reranking. Depends on MemVid and embedding models. +9. **08 - Orchestration** — Wire up the agentic swarm, concurrency, MCP servers. Depends on chunking and retrieval being implemented. + +### Stage 4: Autonomous System +10. **06 - Memory Hierarchy** — Implement ByteRover (Hot), Graphiti (Warm), transition daemons. Depends on MemVid for Cold tier. +11. **09 - Proxy / Shim** — Build the Claude wrapper with context injection. Depends on memory hierarchy and retrieval. +12. **10 - Sleep-Time Compute** — Implement autonomous refinement loops. Depends on proxy and memory hierarchy. + +### Stage 5: Advanced Features +13. **13 - Strategic Integrations** — Hypergraph, Active Inference, Formal Verification, Model Merging, Contrastive Alignment. Only after core system is stable. + +--- + +## Dependency Graph + +``` +Stage 1: [03 Metadata] ──→ [04 Database] + [01 Embedding] ─┐ + │ +Stage 2: [02 Chunking] ←─┘──→ [12 Domains] + [11 Quality] ←── [03] + [04] + +Stage 3: [05 MemVid] ←── [01] + [04] + [07 Retrieval] ←── [01] + [05] + [08 Orchestration] ←── [02] + [07] + +Stage 4: [06 Memory Hierarchy] ←── [05] + [09 Proxy] ←── [06] + [07] + [10 Sleep-Time] ←── [06] + [09] + +Stage 5: [13 Strategic] ←── [06] + [10] +``` + +--- + +## Cross-Document Conflicts Summary + +| Conflict | Documents | Resolution | +|----------|-----------|------------| +| Embedding dimensions | chatgpt5.2-prd vs opus-prd2 | Use opus-prd2 values: 4096 native, MRL options [256,512,1024,2048,4096] | +| Chunk sizes | chatgpt5.2-prd vs opus-prd2 vs AGGREGATION_PLAN | Use opus-prd2 YAML config as authoritative | +| Number of chunking methods | Various (4, 6, or 7) | 7 methods is the complete list | +| Retention periods (hours vs days) | gemini-prd | "h" is a typo; use days | +| Agentic vs algorithmic chunking | chatgpt5.2-prd vs opus-prd2 | Semantic chunking is algorithmic (0.6B model), not full LLM agent | +| QR frames vs text frames | gemini-prd Appendix I vs Appendix II | QR approach is production intent; text rendering is simplified example | +| Gemini embedding alternative | chatgpt5.2-prd only | Not addressed elsewhere; treat as optional future consideration | +| Corpus size | chatgpt5.2-prd (35MB) vs gemini-prd (75K pages) | Different corpora or time horizons; design for the larger estimate | + +--- + +## Tech Stack Summary + +| Component | Technology | +|-----------|-----------| +| Primary Embedding | Qwen3-VL-Embedding-8B | +| Reranker | Qwen3-VL-Reranker-8B | +| Boundary Detection | Qwen3-Embedding-0.6B | +| Text Fallback | Qwen3-Embedding-8B | +| Metadata Store | SQLite | +| Vector Search | FAISS (HNSW) | +| Graph Database | FalkorDB (Bolt Protocol) | +| Video Encoding | FFmpeg (H.265/HEVC) | +| AST Parsing | tree-sitter | +| Languages | Python (primary), TypeScript (interfaces) | +| Orchestration | Headless Claude Code + MCP | +| Local LLM | Ollama | +| Remote Models | OpenRouter (free tier for sleep-time) | +| Target Hardware | M3 Max MacBook Pro |