Insights graph similarity engine + non-gated transformers LLM defaults#69
Open
tony wants to merge 9 commits into
Open
Insights graph similarity engine + non-gated transformers LLM defaults#69tony wants to merge 9 commits into
tony wants to merge 9 commits into
Conversation
7433eff to
534d7af
Compare
22f79f3 to
ca905ec
Compare
f01d7dc to
0b7599f
Compare
why: ADR 0005 defines insights as a staged report over the same local record stream as search — deterministic first, with an opt-in ladder of model-backed enrichers — but the branch carried only the ADR. This lands the engine so a report can be built independently of any frontend. what: - Add the agentgrep.insights package: a typed report model, ADR-0005 cache-directory precedence, a lazy backend loader with an injectable import seam, the deterministic builtin (L0) activity analysis, and the report orchestrator that resolves the effective level and records diagnostics. Probing uses importlib.util.find_spec so a builtin report never imports a heavy backend just to populate the levels field. - Add a curated model registry with a urllib artifact downloader and a manifest sidecar, reused for a torch-free model2vec embedding model so a sentence-embedding model provisions the same way local LLM artifacts do. - Add the L1-L5 enrichers behind capability probes: jinja2 HTML, sklearn TF-IDF/KMeans topics, sentence-transformers|model2vec embeddings with semantic clustering and dedupe, a pluggable tantivy+sqlite-vec or LanceDB persistent index, and an Ollama summary grounded in compact facts that streams tokens through the progress sink. - Declare the insights-* optional-dependency extras.
why: Expose the report pipeline through the public CLI so the same surface is reachable from a terminal, with progress that streams to stderr without polluting machine-readable output on stdout. what: - Add the `agentgrep insights` subcommand tree — report, levels, doctor, setup, models, and cache — with typed argument dataclasses and a text/markdown/html/json/ndjson renderer. - Add a console progress sink that streams phase lines, download bytes, and live LLM token deltas to stderr. - Dispatch the new argument types from main(); keep every insights import function-local so the root --help path stays cold.
why: The base package must pass with no optional extras installed, so the enrichers are exercised through the loader's import_module seam with fake backend modules rather than real scikit-learn, tantivy, or PyTorch. what: - Add unit tests for the deterministic activity analysis, the report orchestrator's level resolution and status, the lazy loader and typed errors, the model registry and urllib downloader, every enricher level, and the CLI argument parsing and dispatchers. - Extend the import-time guard so importing agentgrep never loads the insights package or any optional backend.
what: - Add the insights CLI guide covering the report, the level ladder, model provisioning, and cache diagnostics. - Register the page in the CLI index toctree and card grid. - Render the model-download and Ollama examples as text fences because they reach the network or a local daemon and cannot run as documentation tests.
… summaries why: The MVP could fetch a Gemma .litertlm artifact but the llm level only ran Ollama, so a downloaded model could not actually produce a summary. LiteRT-LM has an installable in-process runtime, so the llm level can run a local Gemma model end-to-end without a daemon. what: - Add a litert-lm backend to the llm level, selected by --backend, that loads the cached .litertlm via litert_lm.Engine, streams the reply through the progress sink, and provisions the model on demand. - Default the LiteRT token budget to 2048: the budget is the total prompt+output KV-cache size, and undersizing it surfaces as an opaque tensor-allocation failure rather than a clear message. - Quiet the LiteRT-LM C++ runtime to ERROR so streamed output is not buried under model-metadata logging on stderr. - Order llm backends by the requested --backend, declare the insights-llm-litert extra, keep litert_lm out of the import path, and cover the runtime and the not-provisioned path with injected-fake tests.
…ted transformers LLM defaults why: The insights ladder ended at the L5 narrative summary over a single gated Gemma model. Users had no way to see which prompts they repeat, which past conversations resemble the one in front of them, or which workflows are worth saving as Skills — and the only GPU summary path required an HF token plus an accepted license. This adds the `graph` enrichment level and makes the local-LLM backend work out of the box. what: - Add the `graph` level: a prompt/reply/conversation similarity network (sentence-transformers or model2vec embeddings, optional HDBSCAN archetype clustering, sqlite-vec or LanceDB IVF-PQ vector store) that surfaces recurring asks, forgotten-but-similar conversations, and mined workflows, persisted incrementally in a content-hash-keyed graph store. - Draft reusable Skills (SKILL.md) from mined workflows, print-by-default, also exposed over an `insights_skills` MCP tool. - Add a transformers/CUDA LLM backend with a non-gated default chain that needs no HF token: Phi-4-mini (4-bit, native phi3), SmolLM2-1.7B (fp16), Granite-3.3-2b (4-bit), tried in order until one loads. gemma-3-1b-it stays curated but gated and is no longer the default. 4-bit weights load through bitsandbytes NF4 behind the insights-llm-transformers-quant extra. - Add optional conversation-summary vectors: each conversation is embedded by a cached LLM one-line summary instead of a prompt mean, sharpening forgotten-but-similar. - Bundle the #68 human-typed prompt detection (human: query field, Claude/Codex authored-turn tagging) so the branch hand-tests self-contained; it overlaps with that PR and should be reconciled at merge time.
…own backends, list rerankers why: An end-to-end audit of the insights workflow surfaced three defects. The non-gated transformers backend was never connected to the skill namer, so `skills --llm --backend transformers` silently produced deterministic names. An invalid `--backend` was silently mapped to the first available backend instead of being rejected. And the curated reranker model kind could not be listed or installed from the CLI. what: - Wire the transformers backend into `_build_skill_namer`, reusing the default-chain + first-working loader so `skills --llm --backend transformers` names skills with the local model, falling back to deterministic naming only when no model loads. - Reject an unknown `--backend` for the llm level up front with a clear message listing the valid backends, instead of silently running the first available one. - Expose the `reranker` model kind in `models list/available/install --level reranker` (parser choices, InsightsModelsArgs.kind, and the listing/install render branches).
why: The optional LanceDB vector backend was only seam-tested, never run against a live lancedb, and its create_index call had drifted out of date -- the config=IvfPq(...) keyword no longer exists, so building the index raised TypeError on any real corpus. what: - Call create_index with the current metric / index_type / num_partitions / num_sub_vectors signature. - Derive num_sub_vectors from the embedding dimension, stepping down to a divisor so PQ accepts it, and size num_partitions as the integer sqrt of the row count.
…rompts
why: insights skills was hardcoded to conversation scope -- the 200k-record
transcript pile that is ~98% tool and assistant output -- and hard-capped
at 8 unranked suggestions. It surfaced a handful of noisy picks instead of
the user's real recurring asks. The typed-prompt corpus (claude.history)
is the clean signal, and dense clustering over it is seconds, not a wall.
what:
- Add --scope {prompts,conversations,all} to insights skills, defaulting
to prompts so the clean typed asks drive the mining; conversation scope
stays available for sequence-based macro workflows.
- Lift the hard cap from 8 to 50 and rank recurring-ask templates by reuse
value (support times distinct conversations) so the most broadly-repeated
asks lead.
- Require a macro chain to recur at least three times before it leads the
list, so a barely-recurring sequence no longer outranks a broadly
repeated template.
ca905ec to
ea37d76
Compare
0b7599f to
85155ae
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the
graphinsights enrichment level — a prompt/reply/conversation similarity network that surfaces recurring asks, forgotten-but-similar past conversations, and mined workflows drafted as reusable Skills — and makes the local-LLM backend usable with no Hugging Face token by defaulting to non-gated, brand-name models.graphlevel: sentence-transformers or model2vec embeddings, optional HDBSCAN archetype clustering, and a sqlite-vec or LanceDB IVF-PQ vector store, persisted incrementally in a content-hash-keyed graph store.phi3) → SmolLM2-1.7B (fp16) → Granite-3.3-2b (4-bit), tried in order until one loads.gemma-3-1b-itstays curated but gated and is no longer the default.insights_skillsMCP tool.human:query field, Claude/Codex authored-turn tagging) so the branch hand-tests self-contained; it overlaps with that PR and should be reconciled at merge.Design decisions
--model gemma-3-1b-it.phi3, nottrust_remote_code: Phi-4-mini ships a vendoredmodeling_phi3.pythat importsLossKwargs, removed in transformers 5.x. Loading via the built-inPhi3ForCausalLM(trust_remote_code=False) tracks the installed transformers and removes the custom-code trust boundary.bitsandbytesNF4 (theinsights-llm-transformers-quantextra) only gates the two quantized candidates.Test plan
uv run ruff check .anduv run ty checkuv run pytest --reruns 0(full suite green), including offline seam tests for the fallback chain and the 4-bit path with faketorch/transformers/bitsandbytesjust build-docsbitsandbytes0.49.2 works ontorch 2.12+cu130; Phi-4-mini 4-bit loads at ~3.2 GB VRAM and produced a real L5 narrative summary plus real per-conversation summaries on the GPU.Sample report output
agentgrep insights — sample reports
Real output captured while validating Increment F against your local Claude
history (RTX 3050 Ti, 4 GB). Every report has the same shape: a deterministic
facts block (top terms, work areas, timeline, repeated instructions, open
threads) followed by a level-specific
Enrichment:block. Default rendereris text;
--format json|ndjson|html|markdownare also available.--level llm --backend transformers— Phi-4-mini narrative summaryThe final paragraph is what Phi-4-mini generates on the GPU (
@cuda, 4-bit,~3.2 GB VRAM); everything above it is deterministic. It is grounded — it only
restates facts from the block above and invents no specifics.
--level graph --conversation-summaries— similarity engineHere
--conversation-summariesvectors each conversation by an embedding of itsPhi-generated one-line summary (cached by content-hash in the store's
summariestable) instead of a mean of raw prompt vectors. That is what drivesforgotten-but-similar— cosine scores between the latest conversation andsemantically-near past ones you may have forgotten. Sample cached summaries:
To reproduce these live, the graph engine + transformers backend currently live
in the v9 stash, not the working tree —
git stash apply stash@{0}first.