Skip to content

Insights graph similarity engine + non-gated transformers LLM defaults#69

Open
tony wants to merge 9 commits into
agentgrep-human-typed-promptsfrom
insights-graph-engine
Open

Insights graph similarity engine + non-gated transformers LLM defaults#69
tony wants to merge 9 commits into
agentgrep-human-typed-promptsfrom
insights-graph-engine

Conversation

@tony

@tony tony commented Jun 14, 2026

Copy link
Copy Markdown
Owner

Summary

Adds the graph insights enrichment level — a prompt/reply/conversation similarity network that surfaces recurring asks, forgotten-but-similar past conversations, and mined workflows drafted as reusable Skills — and makes the local-LLM backend usable with no Hugging Face token by defaulting to non-gated, brand-name models.

  • Add the graph level: sentence-transformers or model2vec embeddings, optional HDBSCAN archetype clustering, and a sqlite-vec or LanceDB IVF-PQ vector store, persisted incrementally in a content-hash-keyed graph store.
  • Add a transformers/CUDA LLM backend with a non-gated default fallback chain — Phi-4-mini (4-bit, native phi3) → SmolLM2-1.7B (fp16) → Granite-3.3-2b (4-bit), tried in order until one loads. gemma-3-1b-it stays curated but gated and is no longer the default.
  • Add optional conversation-summary vectors: each conversation is embedded by a cached LLM one-line summary instead of a prompt mean, sharpening forgotten-but-similar.
  • Add SKILL.md drafting from mined workflows (print-by-default) plus an insights_skills MCP tool.
  • Bundle the Distinguish human-typed prompts from tool results in Claude history #68 human-typed prompt detection (human: query field, Claude/Codex authored-turn tagging) so the branch hand-tests self-contained; it overlaps with that PR and should be reconciled at merge.

Design decisions

  • Non-gated by default, gated by opt-in: the curated defaults are recognizable-brand, permissively-licensed, ungated models that fit a 4 GB GPU. The gated Gemma remains reachable via --model gemma-3-1b-it.
  • Native phi3, not trust_remote_code: Phi-4-mini ships a vendored modeling_phi3.py that imports LossKwargs, removed in transformers 5.x. Loading via the built-in Phi3ForCausalLM (trust_remote_code=False) tracks the installed transformers and removes the custom-code trust boundary.
  • fp16 SmolLM2 as the quant-free safety net: the chain is ordered so a host without a working 4-bit library still gets a token-free default; bitsandbytes NF4 (the insights-llm-transformers-quant extra) only gates the two quantized candidates.

Test plan

  • uv run ruff check . and uv run ty check
  • uv run pytest --reruns 0 (full suite green), including offline seam tests for the fallback chain and the 4-bit path with fake torch/transformers/bitsandbytes
  • just build-docs
  • Validated on an RTX 3050 Ti (4 GB): bitsandbytes 0.49.2 works on torch 2.12+cu130; Phi-4-mini 4-bit loads at ~3.2 GB VRAM and produced a real L5 narrative summary plus real per-conversation summaries on the GPU.

Sample report output

agentgrep insights — sample reports

Real output captured while validating Increment F against your local Claude
history (RTX 3050 Ti, 4 GB). Every report has the same shape: a deterministic
facts block (top terms, work areas, timeline, repeated instructions, open
threads) followed by a level-specific Enrichment: block. Default renderer
is text; --format json|ndjson|html|markdown are also available.


--level llm --backend transformers — Phi-4-mini narrative summary

$ agentgrep insights report \
    --level llm --backend transformers \
    --auto-download-models --yes \
    --agent claude --since 21d --limit 400
Analyzed 400 records across 83 work areas and 8 days.
level: llm   status: ok   records: 400  (sampled)
agents: claude (400)
range: 2026-06-07T23:32:09.659000Z → 2026-06-14T17:29:08.019000Z

Top terms: study (238), graph (235), real (169), claude (159), stash (154), prompt (150), conversation (146), torch (139), commit (130), amd64 (129)

Work areas:
  - session 6787d58f-d33  (34 records)  graph, real, stash, prompt, conversation
  - session c2b51ae5-39e  (32 records)  study, docs, adr, python, agentgrep
  - session 538b449b-2c4  (23 records)  study, commit, branch, merge, python
  - session 5c76f2fc-cd6  (15 records)  agy, review, plugins, gpt, study
  - session 2e19a646-607  (12 records)  pnpm, mobx, project, study, instead
  - session 719ce2c1-f9a  (12 records)  changelog, merge, commit, push, changes
  - session 44535a73-90f  (11 records)  commit, review, truncation, changelog, merge
  - session 500d8a2a-2aa  (11 records)  github, com, https, nvidia, nousresearch

Timeline:
  2026-06-07    1
  2026-06-08   30
  2026-06-09   14
  2026-06-10    5
  2026-06-11   20
  2026-06-12   27
  2026-06-13   84
  2026-06-14  219

Repeated instructions:
  - commit *your* files (leave the others alone)
  - /resume
  - /pr:merge-commit
  - /code-review:code-review
  - /new
  - merge in a merge commit with the above via gh, not git merge
  - study the branch, gain situational awareness
  - study ~/work/notes/ and ~/study/<domain,language>/<project> for details in

Open threads:
  - [claude] You're using websearch, you can't search hugging face? O_o isn't there ways you can search by the most popular in these parameters?
  - [claude] /weave:ask Which demos here look most promising in terms of DX, typing, testability, maintainability, expressiveness?
  - [claude] what would ~/work/python/libtmux-mcp need to clue it in to catch chainable oppurtunities that could be batched?
  - [claude] study ~/study/c/tmux - would split-window be possible on a window in a single chained command?
  - [claude] why didn't you start them in bulk? don't you have bulk tools?
  - [claude] /changelog:changelog do we have a hgih level changelog for what this branch does?
  - [claude] look at gh, is it passing?
  - [claude] study the codebase, i was feeding some agent instructions that may have created dumb doc-based tests where we test that documentation is mentioned (and created…
  - [claude] "agentgrep now treats Google Antigravity as two separate backends: `--agent antigravity-cli` for CLI prompt history and `--agent antigravity-ide` for IDE-local…
  - [claude] is there a way you could test this in a sandbox and see if it works?

Enrichment: llm (transformers) — ok
  summarized via transformers:phi-4-mini-instruct

    The developer worked on a series of AI-assistant prompts, analyzing 400 records from June 7, 2026, to June 14, 2026, using the agent "claude." The top terms in the records included study, graph, real, claude, stash, prompt, conversation, torch, commit, and amd64. The busiest day was June 14, 2026, with 219 records analyzed. Several open threads were identified, including discussions about websearch, hugging face, demos, typing, testability, tmux, bulk tools, changelogs, GitHub, and studying the codebase. Unresolved issues included the possibility of splitting windows in tmux, the use of bulk tools, the creation of a high-level changelog, and the passing of GitHub.

Next:
  $ agentgrep insights levels

The final paragraph is what Phi-4-mini generates on the GPU (@cuda, 4-bit,
~3.2 GB VRAM); everything above it is deterministic. It is grounded — it only
restates facts from the block above and invents no specifics.


--level graph --conversation-summaries — similarity engine

$ agentgrep insights report \
    --level graph --conversation-summaries \
    --backend transformers \
    --auto-download-models --yes \
    --agent claude --since 4d --limit 140
Analyzed 140 records across 59 work areas.
level: graph   status: ok   records: 140  (sampled)
agents: claude (140)

Top terms: vcspull (426), git (308), github (178), com (172), commit (168), repos (162), repo (150), study (136), python (133), yaml (127)

Work areas:
  - conversation 4375e0c3-e0a  (53 records)  vcspull, git, github, com, repos
  - conversation eff87fc8-7b9  (19 records)  xfail, fix, bug, commit, pytest
  - conversation fe899c77-75c  (11 records)  tmux, set, keys, csi, xterm
  - conversation 98447f5c-0e5  (2 records)  command, local, caveat, messages, user
  - session msg_011FRDRy  (1 records)  rebase, succeeded, autostash, reapplied, let
  - session msg_011FwFdJ  (1 records)  great, research, results, let, read
  - session msg_012nNH1V  (1 records)  let, verify, state, clean, vcspull
  - session msg_0137gWHz  (1 records)  let, read, current, tdd, fix

Open threads:
  - [claude] `git diff` - i want to update tis to the latest and greatest terminal wise **Yes**, the line works for both of your terminals. ### Quick Answers | Question | A…

Enrichment: graph (sentence-transformers) — ok
  networked 29 prompts / 144 replies across 4 conversations; 0 workflows; reused 9 cached prompt vectors
    network: 29 prompts, 144 replies, 26 exchanges, 12 conversations, 48 edges
    store: ~/.cache/agentgrep/index/graph/graph.db
    similar prompts (recurring asks, clustered):
      [2x across 2 convos] study the commit style for .vcspull.yaml changes
      [2x across 1 convos] for each directory in ~/study/, do `vcspull discover ~/study/<directory>` and import all, 
      [2x across 1 convos] for tdd fix, do we have assurances / guards that ensure: wait wait. I'm not asking if you 
    forgotten-but-similar (nearest past conversations to the latest):
      0.84  8356621f-c05f-4b3f-b91e-6b02066b7c30
      0.84  a5225987-da5b-46e1-bdc0-1729aec33d2f
      0.83  9b9bc461-20e6-4c9e-a6db-30f3def03872
      0.82  fe899c77-75cf-4f13-b487-72f3203fe392
      0.82  eff87fc8-7b9b-449d-aed6-575185fd684a

Next:
  $ agentgrep insights levels
  $ agentgrep insights report --level llm

Here --conversation-summaries vectors each conversation by an embedding of its
Phi-generated one-line summary (cached by content-hash in the store's
summaries table) instead of a mean of raw prompt vectors. That is what drives
forgotten-but-similar — cosine scores between the latest conversation and
semantically-near past ones you may have forgotten. Sample cached summaries:

The user encountered a TOML parsing error while updating the UV library.
I want to learn about leveraging marimo notebooks and best practices.
The user wants to understand the security risks of installing the OMP project via curl and a potential review of the project itself.

To reproduce these live, the graph engine + transformers backend currently live
in the v9 stash, not the working tree — git stash apply stash@{0} first.

@tony tony force-pushed the insights-graph-engine branch from 7433eff to 534d7af Compare June 14, 2026 21:02
@tony tony changed the base branch from workflow-00 to agentgrep-human-typed-prompts June 15, 2026 00:22
@tony tony force-pushed the agentgrep-human-typed-prompts branch from 22f79f3 to ca905ec Compare June 22, 2026 11:41
@tony tony force-pushed the insights-graph-engine branch from f01d7dc to 0b7599f Compare June 22, 2026 11:59
tony added 9 commits June 27, 2026 12:32
why: ADR 0005 defines insights as a staged report over the same local
record stream as search — deterministic first, with an opt-in ladder of
model-backed enrichers — but the branch carried only the ADR. This lands
the engine so a report can be built independently of any frontend.

what:
- Add the agentgrep.insights package: a typed report model, ADR-0005
  cache-directory precedence, a lazy backend loader with an injectable
  import seam, the deterministic builtin (L0) activity analysis, and the
  report orchestrator that resolves the effective level and records
  diagnostics. Probing uses importlib.util.find_spec so a builtin report
  never imports a heavy backend just to populate the levels field.
- Add a curated model registry with a urllib artifact downloader and a
  manifest sidecar, reused for a torch-free model2vec embedding model so
  a sentence-embedding model provisions the same way local LLM artifacts
  do.
- Add the L1-L5 enrichers behind capability probes: jinja2 HTML, sklearn
  TF-IDF/KMeans topics, sentence-transformers|model2vec embeddings with
  semantic clustering and dedupe, a pluggable tantivy+sqlite-vec or
  LanceDB persistent index, and an Ollama summary grounded in compact
  facts that streams tokens through the progress sink.
- Declare the insights-* optional-dependency extras.
why: Expose the report pipeline through the public CLI so the same
surface is reachable from a terminal, with progress that streams to
stderr without polluting machine-readable output on stdout.

what:
- Add the `agentgrep insights` subcommand tree — report, levels, doctor,
  setup, models, and cache — with typed argument dataclasses and a
  text/markdown/html/json/ndjson renderer.
- Add a console progress sink that streams phase lines, download bytes,
  and live LLM token deltas to stderr.
- Dispatch the new argument types from main(); keep every insights
  import function-local so the root --help path stays cold.
why: The base package must pass with no optional extras installed, so the
enrichers are exercised through the loader's import_module seam with fake
backend modules rather than real scikit-learn, tantivy, or PyTorch.

what:
- Add unit tests for the deterministic activity analysis, the report
  orchestrator's level resolution and status, the lazy loader and typed
  errors, the model registry and urllib downloader, every enricher level,
  and the CLI argument parsing and dispatchers.
- Extend the import-time guard so importing agentgrep never loads the
  insights package or any optional backend.
what:
- Add the insights CLI guide covering the report, the level ladder,
  model provisioning, and cache diagnostics.
- Register the page in the CLI index toctree and card grid.
- Render the model-download and Ollama examples as text fences because
  they reach the network or a local daemon and cannot run as
  documentation tests.
… summaries

why: The MVP could fetch a Gemma .litertlm artifact but the llm level only
ran Ollama, so a downloaded model could not actually produce a summary.
LiteRT-LM has an installable in-process runtime, so the llm level can run
a local Gemma model end-to-end without a daemon.

what:
- Add a litert-lm backend to the llm level, selected by --backend, that
  loads the cached .litertlm via litert_lm.Engine, streams the reply
  through the progress sink, and provisions the model on demand.
- Default the LiteRT token budget to 2048: the budget is the total
  prompt+output KV-cache size, and undersizing it surfaces as an opaque
  tensor-allocation failure rather than a clear message.
- Quiet the LiteRT-LM C++ runtime to ERROR so streamed output is not
  buried under model-metadata logging on stderr.
- Order llm backends by the requested --backend, declare the
  insights-llm-litert extra, keep litert_lm out of the import path, and
  cover the runtime and the not-provisioned path with injected-fake tests.
…ted transformers LLM defaults

why: The insights ladder ended at the L5 narrative summary over a single
gated Gemma model. Users had no way to see which prompts they repeat,
which past conversations resemble the one in front of them, or which
workflows are worth saving as Skills — and the only GPU summary path
required an HF token plus an accepted license. This adds the `graph`
enrichment level and makes the local-LLM backend work out of the box.

what:
- Add the `graph` level: a prompt/reply/conversation similarity network
  (sentence-transformers or model2vec embeddings, optional HDBSCAN
  archetype clustering, sqlite-vec or LanceDB IVF-PQ vector store) that
  surfaces recurring asks, forgotten-but-similar conversations, and mined
  workflows, persisted incrementally in a content-hash-keyed graph store.
- Draft reusable Skills (SKILL.md) from mined workflows, print-by-default,
  also exposed over an `insights_skills` MCP tool.
- Add a transformers/CUDA LLM backend with a non-gated default chain that
  needs no HF token: Phi-4-mini (4-bit, native phi3), SmolLM2-1.7B (fp16),
  Granite-3.3-2b (4-bit), tried in order until one loads. gemma-3-1b-it
  stays curated but gated and is no longer the default. 4-bit weights load
  through bitsandbytes NF4 behind the insights-llm-transformers-quant extra.
- Add optional conversation-summary vectors: each conversation is embedded
  by a cached LLM one-line summary instead of a prompt mean, sharpening
  forgotten-but-similar.
- Bundle the #68 human-typed prompt detection (human: query field,
  Claude/Codex authored-turn tagging) so the branch hand-tests
  self-contained; it overlaps with that PR and should be reconciled at
  merge time.
…own backends, list rerankers

why: An end-to-end audit of the insights workflow surfaced three defects.
The non-gated transformers backend was never connected to the skill
namer, so `skills --llm --backend transformers` silently produced
deterministic names. An invalid `--backend` was silently mapped to the
first available backend instead of being rejected. And the curated
reranker model kind could not be listed or installed from the CLI.

what:
- Wire the transformers backend into `_build_skill_namer`, reusing the
  default-chain + first-working loader so `skills --llm --backend
  transformers` names skills with the local model, falling back to
  deterministic naming only when no model loads.
- Reject an unknown `--backend` for the llm level up front with a clear
  message listing the valid backends, instead of silently running the
  first available one.
- Expose the `reranker` model kind in `models list/available/install
  --level reranker` (parser choices, InsightsModelsArgs.kind, and the
  listing/install render branches).
why: The optional LanceDB vector backend was only seam-tested, never run
against a live lancedb, and its create_index call had drifted out of date
-- the config=IvfPq(...) keyword no longer exists, so building the index
raised TypeError on any real corpus.

what:
- Call create_index with the current metric / index_type /
  num_partitions / num_sub_vectors signature.
- Derive num_sub_vectors from the embedding dimension, stepping down to a
  divisor so PQ accepts it, and size num_partitions as the integer sqrt
  of the row count.
…rompts

why: insights skills was hardcoded to conversation scope -- the 200k-record
transcript pile that is ~98% tool and assistant output -- and hard-capped
at 8 unranked suggestions. It surfaced a handful of noisy picks instead of
the user's real recurring asks. The typed-prompt corpus (claude.history)
is the clean signal, and dense clustering over it is seconds, not a wall.

what:
- Add --scope {prompts,conversations,all} to insights skills, defaulting
  to prompts so the clean typed asks drive the mining; conversation scope
  stays available for sequence-based macro workflows.
- Lift the hard cap from 8 to 50 and rank recurring-ask templates by reuse
  value (support times distinct conversations) so the most broadly-repeated
  asks lead.
- Require a macro chain to recur at least three times before it leads the
  list, so a barely-recurring sequence no longer outranks a broadly
  repeated template.
@tony tony force-pushed the agentgrep-human-typed-prompts branch from ca905ec to ea37d76 Compare June 27, 2026 18:47
@tony tony force-pushed the insights-graph-engine branch from 0b7599f to 85155ae Compare June 27, 2026 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant