Insights graph similarity engine + non-gated transformers LLM defaults by tony · Pull Request #69 · tony/agentgrep

tony · 2026-06-14T18:36:43Z

Summary

Adds the graph insights enrichment level — a prompt/reply/conversation similarity network that surfaces recurring asks, forgotten-but-similar past conversations, and mined workflows drafted as reusable Skills — and makes the local-LLM backend usable with no Hugging Face token by defaulting to non-gated, brand-name models.

Add the graph level: sentence-transformers or model2vec embeddings, optional HDBSCAN archetype clustering, and a sqlite-vec or LanceDB IVF-PQ vector store, persisted incrementally in a content-hash-keyed graph store.
Add a transformers/CUDA LLM backend with a non-gated default fallback chain — Phi-4-mini (4-bit, native phi3) → SmolLM2-1.7B (fp16) → Granite-3.3-2b (4-bit), tried in order until one loads. gemma-3-1b-it stays curated but gated and is no longer the default.
Add optional conversation-summary vectors: each conversation is embedded by a cached LLM one-line summary instead of a prompt mean, sharpening forgotten-but-similar.
Add SKILL.md drafting from mined workflows (print-by-default) plus an insights_skills MCP tool.
Bundle the Distinguish human-typed prompts from tool results in Claude history #68 human-typed prompt detection (human: query field, Claude/Codex authored-turn tagging) so the branch hand-tests self-contained; it overlaps with that PR and should be reconciled at merge.

Design decisions

Non-gated by default, gated by opt-in: the curated defaults are recognizable-brand, permissively-licensed, ungated models that fit a 4 GB GPU. The gated Gemma remains reachable via --model gemma-3-1b-it.
Native phi3, not trust_remote_code: Phi-4-mini ships a vendored modeling_phi3.py that imports LossKwargs, removed in transformers 5.x. Loading via the built-in Phi3ForCausalLM (trust_remote_code=False) tracks the installed transformers and removes the custom-code trust boundary.
fp16 SmolLM2 as the quant-free safety net: the chain is ordered so a host without a working 4-bit library still gets a token-free default; bitsandbytes NF4 (the insights-llm-transformers-quant extra) only gates the two quantized candidates.

Test plan

uv run ruff check . and uv run ty check
uv run pytest --reruns 0 (full suite green), including offline seam tests for the fallback chain and the 4-bit path with fake torch/transformers/bitsandbytes
just build-docs
Validated on an RTX 3050 Ti (4 GB): bitsandbytes 0.49.2 works on torch 2.12+cu130; Phi-4-mini 4-bit loads at ~3.2 GB VRAM and produced a real L5 narrative summary plus real per-conversation summaries on the GPU.

Sample report output

agentgrep insights — sample reports

Real output captured while validating Increment F against your local Claude
history (RTX 3050 Ti, 4 GB). Every report has the same shape: a deterministic
facts block (top terms, work areas, timeline, repeated instructions, open
threads) followed by a level-specific Enrichment: block. Default renderer
is text; --format json|ndjson|html|markdown are also available.

`--level llm --backend transformers` — Phi-4-mini narrative summary

$ agentgrep insights report \
    --level llm --backend transformers \
    --auto-download-models --yes \
    --agent claude --since 21d --limit 400

Analyzed 400 records across 83 work areas and 8 days.
level: llm   status: ok   records: 400  (sampled)
agents: claude (400)
range: 2026-06-07T23:32:09.659000Z → 2026-06-14T17:29:08.019000Z

Top terms: study (238), graph (235), real (169), claude (159), stash (154), prompt (150), conversation (146), torch (139), commit (130), amd64 (129)

Work areas:
  - session 6787d58f-d33  (34 records)  graph, real, stash, prompt, conversation
  - session c2b51ae5-39e  (32 records)  study, docs, adr, python, agentgrep
  - session 538b449b-2c4  (23 records)  study, commit, branch, merge, python
  - session 5c76f2fc-cd6  (15 records)  agy, review, plugins, gpt, study
  - session 2e19a646-607  (12 records)  pnpm, mobx, project, study, instead
  - session 719ce2c1-f9a  (12 records)  changelog, merge, commit, push, changes
  - session 44535a73-90f  (11 records)  commit, review, truncation, changelog, merge
  - session 500d8a2a-2aa  (11 records)  github, com, https, nvidia, nousresearch

Timeline:
  2026-06-07    1
  2026-06-08   30
  2026-06-09   14
  2026-06-10    5
  2026-06-11   20
  2026-06-12   27
  2026-06-13   84
  2026-06-14  219

Repeated instructions:
  - commit *your* files (leave the others alone)
  - /resume
  - /pr:merge-commit
  - /code-review:code-review
  - /new
  - merge in a merge commit with the above via gh, not git merge
  - study the branch, gain situational awareness
  - study ~/work/notes/ and ~/study/<domain,language>/<project> for details in

Open threads:
  - [claude] You're using websearch, you can't search hugging face? O_o isn't there ways you can search by the most popular in these parameters?
  - [claude] /weave:ask Which demos here look most promising in terms of DX, typing, testability, maintainability, expressiveness?
  - [claude] what would ~/work/python/libtmux-mcp need to clue it in to catch chainable oppurtunities that could be batched?
  - [claude] study ~/study/c/tmux - would split-window be possible on a window in a single chained command?
  - [claude] why didn't you start them in bulk? don't you have bulk tools?
  - [claude] /changelog:changelog do we have a hgih level changelog for what this branch does?
  - [claude] look at gh, is it passing?
  - [claude] study the codebase, i was feeding some agent instructions that may have created dumb doc-based tests where we test that documentation is mentioned (and created…
  - [claude] "agentgrep now treats Google Antigravity as two separate backends: `--agent antigravity-cli` for CLI prompt history and `--agent antigravity-ide` for IDE-local…
  - [claude] is there a way you could test this in a sandbox and see if it works?

Enrichment: llm (transformers) — ok
  summarized via transformers:phi-4-mini-instruct

    The developer worked on a series of AI-assistant prompts, analyzing 400 records from June 7, 2026, to June 14, 2026, using the agent "claude." The top terms in the records included study, graph, real, claude, stash, prompt, conversation, torch, commit, and amd64. The busiest day was June 14, 2026, with 219 records analyzed. Several open threads were identified, including discussions about websearch, hugging face, demos, typing, testability, tmux, bulk tools, changelogs, GitHub, and studying the codebase. Unresolved issues included the possibility of splitting windows in tmux, the use of bulk tools, the creation of a high-level changelog, and the passing of GitHub.

Next:
  $ agentgrep insights levels

The final paragraph is what Phi-4-mini generates on the GPU (@cuda, 4-bit,
~3.2 GB VRAM); everything above it is deterministic. It is grounded — it only
restates facts from the block above and invents no specifics.

`--level graph --conversation-summaries` — similarity engine

$ agentgrep insights report \
    --level graph --conversation-summaries \
    --backend transformers \
    --auto-download-models --yes \
    --agent claude --since 4d --limit 140

Analyzed 140 records across 59 work areas.
level: graph   status: ok   records: 140  (sampled)
agents: claude (140)

Top terms: vcspull (426), git (308), github (178), com (172), commit (168), repos (162), repo (150), study (136), python (133), yaml (127)

Work areas:
  - conversation 4375e0c3-e0a  (53 records)  vcspull, git, github, com, repos
  - conversation eff87fc8-7b9  (19 records)  xfail, fix, bug, commit, pytest
  - conversation fe899c77-75c  (11 records)  tmux, set, keys, csi, xterm
  - conversation 98447f5c-0e5  (2 records)  command, local, caveat, messages, user
  - session msg_011FRDRy  (1 records)  rebase, succeeded, autostash, reapplied, let
  - session msg_011FwFdJ  (1 records)  great, research, results, let, read
  - session msg_012nNH1V  (1 records)  let, verify, state, clean, vcspull
  - session msg_0137gWHz  (1 records)  let, read, current, tdd, fix

Open threads:
  - [claude] `git diff` - i want to update tis to the latest and greatest terminal wise **Yes**, the line works for both of your terminals. ### Quick Answers | Question | A…

Enrichment: graph (sentence-transformers) — ok
  networked 29 prompts / 144 replies across 4 conversations; 0 workflows; reused 9 cached prompt vectors
    network: 29 prompts, 144 replies, 26 exchanges, 12 conversations, 48 edges
    store: ~/.cache/agentgrep/index/graph/graph.db
    similar prompts (recurring asks, clustered):
      [2x across 2 convos] study the commit style for .vcspull.yaml changes
      [2x across 1 convos] for each directory in ~/study/, do `vcspull discover ~/study/<directory>` and import all, 
      [2x across 1 convos] for tdd fix, do we have assurances / guards that ensure: wait wait. I'm not asking if you 
    forgotten-but-similar (nearest past conversations to the latest):
      0.84  8356621f-c05f-4b3f-b91e-6b02066b7c30
      0.84  a5225987-da5b-46e1-bdc0-1729aec33d2f
      0.83  9b9bc461-20e6-4c9e-a6db-30f3def03872
      0.82  fe899c77-75cf-4f13-b487-72f3203fe392
      0.82  eff87fc8-7b9b-449d-aed6-575185fd684a

Next:
  $ agentgrep insights levels
  $ agentgrep insights report --level llm

Here --conversation-summaries vectors each conversation by an embedding of its
Phi-generated one-line summary (cached by content-hash in the store's
summaries table) instead of a mean of raw prompt vectors. That is what drives
forgotten-but-similar — cosine scores between the latest conversation and
semantically-near past ones you may have forgotten. Sample cached summaries:

The user encountered a TOML parsing error while updating the UV library.
I want to learn about leveraging marimo notebooks and best practices.
The user wants to understand the security risks of installing the OMP project via curl and a potential review of the project itself.

To reproduce these live, the graph engine + transformers backend currently live
in the v9 stash, not the working tree — git stash apply stash@{0} first.

why: ADR 0005 defines insights as a staged report over the same local record stream as search — deterministic first, with an opt-in ladder of model-backed enrichers — but the branch carried only the ADR. This lands the engine so a report can be built independently of any frontend. what: - Add the agentgrep.insights package: a typed report model, ADR-0005 cache-directory precedence, a lazy backend loader with an injectable import seam, the deterministic builtin (L0) activity analysis, and the report orchestrator that resolves the effective level and records diagnostics. Probing uses importlib.util.find_spec so a builtin report never imports a heavy backend just to populate the levels field. - Add a curated model registry with a urllib artifact downloader and a manifest sidecar, reused for a torch-free model2vec embedding model so a sentence-embedding model provisions the same way local LLM artifacts do. - Add the L1-L5 enrichers behind capability probes: jinja2 HTML, sklearn TF-IDF/KMeans topics, sentence-transformers|model2vec embeddings with semantic clustering and dedupe, a pluggable tantivy+sqlite-vec or LanceDB persistent index, and an Ollama summary grounded in compact facts that streams tokens through the progress sink. - Declare the insights-* optional-dependency extras.

why: Expose the report pipeline through the public CLI so the same surface is reachable from a terminal, with progress that streams to stderr without polluting machine-readable output on stdout. what: - Add the `agentgrep insights` subcommand tree — report, levels, doctor, setup, models, and cache — with typed argument dataclasses and a text/markdown/html/json/ndjson renderer. - Add a console progress sink that streams phase lines, download bytes, and live LLM token deltas to stderr. - Dispatch the new argument types from main(); keep every insights import function-local so the root --help path stays cold.

why: The base package must pass with no optional extras installed, so the enrichers are exercised through the loader's import_module seam with fake backend modules rather than real scikit-learn, tantivy, or PyTorch. what: - Add unit tests for the deterministic activity analysis, the report orchestrator's level resolution and status, the lazy loader and typed errors, the model registry and urllib downloader, every enricher level, and the CLI argument parsing and dispatchers. - Extend the import-time guard so importing agentgrep never loads the insights package or any optional backend.

what: - Add the insights CLI guide covering the report, the level ladder, model provisioning, and cache diagnostics. - Register the page in the CLI index toctree and card grid. - Render the model-download and Ollama examples as text fences because they reach the network or a local daemon and cannot run as documentation tests.

… summaries why: The MVP could fetch a Gemma .litertlm artifact but the llm level only ran Ollama, so a downloaded model could not actually produce a summary. LiteRT-LM has an installable in-process runtime, so the llm level can run a local Gemma model end-to-end without a daemon. what: - Add a litert-lm backend to the llm level, selected by --backend, that loads the cached .litertlm via litert_lm.Engine, streams the reply through the progress sink, and provisions the model on demand. - Default the LiteRT token budget to 2048: the budget is the total prompt+output KV-cache size, and undersizing it surfaces as an opaque tensor-allocation failure rather than a clear message. - Quiet the LiteRT-LM C++ runtime to ERROR so streamed output is not buried under model-metadata logging on stderr. - Order llm backends by the requested --backend, declare the insights-llm-litert extra, keep litert_lm out of the import path, and cover the runtime and the not-provisioned path with injected-fake tests.

…ted transformers LLM defaults why: The insights ladder ended at the L5 narrative summary over a single gated Gemma model. Users had no way to see which prompts they repeat, which past conversations resemble the one in front of them, or which workflows are worth saving as Skills — and the only GPU summary path required an HF token plus an accepted license. This adds the `graph` enrichment level and makes the local-LLM backend work out of the box. what: - Add the `graph` level: a prompt/reply/conversation similarity network (sentence-transformers or model2vec embeddings, optional HDBSCAN archetype clustering, sqlite-vec or LanceDB IVF-PQ vector store) that surfaces recurring asks, forgotten-but-similar conversations, and mined workflows, persisted incrementally in a content-hash-keyed graph store. - Draft reusable Skills (SKILL.md) from mined workflows, print-by-default, also exposed over an `insights_skills` MCP tool. - Add a transformers/CUDA LLM backend with a non-gated default chain that needs no HF token: Phi-4-mini (4-bit, native phi3), SmolLM2-1.7B (fp16), Granite-3.3-2b (4-bit), tried in order until one loads. gemma-3-1b-it stays curated but gated and is no longer the default. 4-bit weights load through bitsandbytes NF4 behind the insights-llm-transformers-quant extra. - Add optional conversation-summary vectors: each conversation is embedded by a cached LLM one-line summary instead of a prompt mean, sharpening forgotten-but-similar. - Bundle the #68 human-typed prompt detection (human: query field, Claude/Codex authored-turn tagging) so the branch hand-tests self-contained; it overlaps with that PR and should be reconciled at merge time.

…own backends, list rerankers why: An end-to-end audit of the insights workflow surfaced three defects. The non-gated transformers backend was never connected to the skill namer, so `skills --llm --backend transformers` silently produced deterministic names. An invalid `--backend` was silently mapped to the first available backend instead of being rejected. And the curated reranker model kind could not be listed or installed from the CLI. what: - Wire the transformers backend into `_build_skill_namer`, reusing the default-chain + first-working loader so `skills --llm --backend transformers` names skills with the local model, falling back to deterministic naming only when no model loads. - Reject an unknown `--backend` for the llm level up front with a clear message listing the valid backends, instead of silently running the first available one. - Expose the `reranker` model kind in `models list/available/install --level reranker` (parser choices, InsightsModelsArgs.kind, and the listing/install render branches).

why: The optional LanceDB vector backend was only seam-tested, never run against a live lancedb, and its create_index call had drifted out of date -- the config=IvfPq(...) keyword no longer exists, so building the index raised TypeError on any real corpus. what: - Call create_index with the current metric / index_type / num_partitions / num_sub_vectors signature. - Derive num_sub_vectors from the embedding dimension, stepping down to a divisor so PQ accepts it, and size num_partitions as the integer sqrt of the row count.

…rompts why: insights skills was hardcoded to conversation scope -- the 200k-record transcript pile that is ~98% tool and assistant output -- and hard-capped at 8 unranked suggestions. It surfaced a handful of noisy picks instead of the user's real recurring asks. The typed-prompt corpus (claude.history) is the clean signal, and dense clustering over it is seconds, not a wall. what: - Add --scope {prompts,conversations,all} to insights skills, defaulting to prompts so the clean typed asks drive the mining; conversation scope stays available for sequence-based macro workflows. - Lift the hard cap from 8 to 50 and rank recurring-ask templates by reuse value (support times distinct conversations) so the most broadly-repeated asks lead. - Require a macro chain to recur at least three times before it leads the list, so a barely-recurring sequence no longer outranks a broadly repeated template.

tony temporarily deployed to docs June 14, 2026 18:36 — with GitHub Actions Inactive

tony temporarily deployed to docs June 14, 2026 19:50 — with GitHub Actions Inactive

tony force-pushed the insights-graph-engine branch from 7433eff to 534d7af Compare June 14, 2026 21:02

tony changed the base branch from workflow-00 to agentgrep-human-typed-prompts June 15, 2026 00:22

tony force-pushed the agentgrep-human-typed-prompts branch from 22f79f3 to ca905ec Compare June 22, 2026 11:41

tony force-pushed the insights-graph-engine branch from f01d7dc to 0b7599f Compare June 22, 2026 11:59

tony temporarily deployed to docs June 22, 2026 11:59 — with GitHub Actions Inactive

tony added 9 commits June 27, 2026 12:32

tony force-pushed the agentgrep-human-typed-prompts branch from ca905ec to ea37d76 Compare June 27, 2026 18:47

tony force-pushed the insights-graph-engine branch from 0b7599f to 85155ae Compare June 27, 2026 18:47

tony temporarily deployed to docs June 27, 2026 18:47 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Insights graph similarity engine + non-gated transformers LLM defaults#69

Insights graph similarity engine + non-gated transformers LLM defaults#69
tony wants to merge 9 commits into
agentgrep-human-typed-promptsfrom
insights-graph-engine

tony commented Jun 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tony commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design decisions

Test plan

Sample report output

agentgrep insights — sample reports

--level llm --backend transformers — Phi-4-mini narrative summary

--level graph --conversation-summaries — similarity engine

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tony commented Jun 14, 2026 •

edited

Loading

`--level llm --backend transformers` — Phi-4-mini narrative summary

`--level graph --conversation-summaries` — similarity engine