Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 103 additions & 0 deletions docs/reference/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ datasight [OPTIONS] COMMAND [ARGS]...
- `measures`: Surface likely measures and default aggregations.
- `quality`: Audit data quality - nulls, suspicious ranges, and date coverage.
- `tidy`: Detect untidy column shapes and reshape into long form.
- `grounding`: Detect and repair drift between grounding files and the live schema.
- `integrity`: Audit cross-table referential integrity - keys, orphans, and join risks.
- `distribution`: Profile value distributions - percentiles, outliers, and measure flags.
- `validate`: Run declarative validation rules against the database.
Expand Down Expand Up @@ -437,10 +438,16 @@ Runs each question from queries.yaml through the full LLM pipeline,
executes the generated SQL, and compares results against expected values.
Use this to validate correctness across different models and providers.

Before the LLM phase, runs a static schema-drift check that flags
references to columns or tables that no longer exist in the live
database. ``--static-only`` skips the LLM phase entirely;
``--skip-grounding-check`` skips the static check.

Examples:

```
datasight verify
datasight verify --static-only
datasight verify --queries verification.yaml
datasight verify --model gpt-4o
```
Expand Down Expand Up @@ -470,6 +477,8 @@ datasight verify [OPTIONS]
| `--project-dir` | Project directory containing .env and queries.yaml. Default: `.`. |
| `--model` | Model name (overrides .env). |
| `--queries` | Path to queries YAML file (default: queries.yaml in project dir). |
| `--static-only` | Run only the cheap schema-drift check (no LLM, no query execution). Reports unresolved column/table references in queries.yaml, schema_description.md, and time_series.yaml against the live DB. |
| `--skip-grounding-check` | Skip the static drift check that normally runs before the LLM phase. |

### `datasight ask`

Expand Down Expand Up @@ -772,6 +781,100 @@ datasight tidy review [OPTIONS]
| `--replace-source` | Drop the source after a successful reshape and rename the long-form table to take the source's old name. Downstream code that referenced the source keeps working without edits. Requires '--as table' — a view's body references its source by name. |
| `--drop-source` | Drop the source after a successful reshape; the long form keeps its target name. Pick this when the new shape is the canonical one going forward and you don't need to preserve the source's name. Requires '--as table'. NOTE: previously this flag carried the semantics now moved to '--replace-source'; scripts depending on the old behavior should switch to '--replace-source'. |
| `--sample` | Send N sample rows per candidate to the configured LLM provider (default 0). Sample values get sent over the network — opt in only when the LLM seeing the values is acceptable. |
| `--model` | LLM model name to use for the propose-reshapes call and the post-apply grounding-repair call (overrides .env). Useful when different models suit each workload — see docs/use/concepts/choosing-an-llm.md. |

### `datasight grounding`

Detect and repair drift between grounding files and the live schema.

Grounding files (``queries.yaml``, ``schema_description.md``,
``time_series.yaml``) describe the database to the LLM. When the
schema changes (typically after ``datasight tidy review``), these
files fall out of sync and the agent silently hallucinates against
columns that no longer exist.

- ``check`` reports drift without changing anything.
- ``repair`` asks the configured LLM to rewrite the stale files
against the current schema, validates each proposed query, and
writes atomically after you confirm the diff.

Examples:

```
datasight grounding check
datasight grounding repair
datasight grounding repair --model qwen3.6
datasight grounding repair --from-csv load_data.csv
datasight grounding repair --dry-run
```

```bash
datasight grounding [OPTIONS] COMMAND [ARGS]...
```

**Subcommands**

- `check`: Report stale references in grounding files against the live schema.
- `repair`: Run the LLM grounding repair against an existing drift.

#### `datasight grounding check`

Report stale references in grounding files against the live schema.

Static — no LLM, no query execution. Exits 0 when grounding is
clean, 1 when drift is detected. Use ``datasight grounding
repair`` to fix what this command finds.

Examples:

```
datasight grounding check
datasight grounding check --project-dir /path/to/project
```

```bash
datasight grounding check [OPTIONS]
```

**Parameters**

| Name | Details |
| --- | --- |
| `--project-dir` | Project directory containing .env and grounding files. Default: `.`. |

#### `datasight grounding repair`

Run the LLM grounding repair against an existing drift.

Reads the pre-tidy schema snapshot persisted by the most recent
apply (``.datasight/grounding_snapshot.json``). When no snapshot
is on file, ``--from-csv`` lets you supply the wide-form schema
by pointing at the source CSV(s).

Shows the unified diff and prompts for confirmation before writing.
Use ``--dry-run`` to skip the write entirely.

Examples:

```
datasight grounding repair
datasight grounding repair --model qwen3.6
datasight grounding repair --from-csv load_data.csv
datasight grounding repair --dry-run
```

```bash
datasight grounding repair [OPTIONS]
```

**Parameters**

| Name | Details |
| --- | --- |
| `--project-dir` | Project directory containing .env and grounding files. Default: `.`. |
| `--model` | LLM model name to use for the repair (overrides .env). Useful for retrying with a different model after a timeout. |
| `--from-csv` | Derive the pre-tidy schema from CSV headers when no snapshot is available. Pass once per source file (e.g. the wide-format input the apply consumed). Each CSV becomes a single table named after the file stem. Combinable with the snapshot — snapshot tables win on conflict. |
| `--dry-run` | Show drift + LLM proposal + diff, but don't write any files. |

### `datasight integrity`

Expand Down
2 changes: 1 addition & 1 deletion docs/reference/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ For help picking a provider, see [Choosing an LLM](../use/concepts/choosing-an-l

| Variable | Default | Description |
|----------|---------|-------------|
| `OLLAMA_MODEL` | `qwen2.5:7b` | Ollama model name (must support tool calling). `qwen2.5:7b` works well for CLI queries; for the web UI with visualizations, try `qwen2.5:14b`. |
| `OLLAMA_MODEL` | `qwen2.5:7b` | Ollama model name (must support tool calling). `qwen2.5:7b` is the safest cross-platform default (~2 GB resident, fits on 16 GB Macs). For Apple Silicon with 48 GB+ unified memory, `qwen3.6:35b-a3b-coding-mxfp8` gives richer answers at comparable decode speed. See [Choosing an AI provider](../use/concepts/choosing-an-llm.md#apple-silicon-mlx-native-models). |
| `OLLAMA_BASE_URL` | `http://localhost:11434/v1` | Ollama API endpoint |

### Database settings
Expand Down
69 changes: 68 additions & 1 deletion docs/use/concepts/choosing-an-llm.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ So a Llama 3.1 8B model fits in ~5 GB VRAM at 4-bit, a 70B model needs
|---|---|
| Apple Silicon with 16 GB unified memory | 7–8B models at 4-bit |
| Apple Silicon with 32 GB | 13B at 4-bit, or 8B at 8-bit |
| Apple Silicon with 64 GB+ | 34–70B at 4-bit |
| Apple Silicon with 64 GB+ | 34–70B at 4-bit, or sparse-MoE models like Qwen3.6 35B-A3B |
| NVIDIA laptop GPU, 8 GB VRAM | 7–8B at 4-bit |
| NVIDIA laptop GPU, 16 GB VRAM | 13B at 4-bit |

Expand All @@ -107,6 +107,73 @@ visualizations, step up to `qwen2.5:14b` — the 7B model struggles with
the more complex multi-step agent interactions required for chart
generation. Smaller models often struggle with realistic schemas.

### Apple Silicon: MLX-native models

If you're on Apple Silicon, models tagged `-mlx-*` use Apple's MLX
runtime and Metal compute. They typically decode 10–30% faster than the
equivalent GGUF model, but the *resident memory* can be much larger than
the weight size alone suggests because MLX allocates a large KV-cache
buffer for the model's default context window (often 256K tokens).
Measure before recommending to users — the model card's parameter count
is not a reliable predictor of laptop fit.

Measured on a single benchmark dataset (5 questions, agent loop with
tool calls, Ollama server keep-alive at default 5 min) on a Mac with
unified memory:

| Model | Decode (tok/s) | Resident memory (incl. KV cache) | Answer style |
|---|---|---|---|
| `qwen2.5:7b` (q4_K_M, GGUF) | ~85 | **~2 GB** | Middle: substantive but can hit `max_tokens` |
| `gemma4:e2b-mlx-bf16` | ~95 | ~11 GB | Tersest: dumps data tables, minimal analysis |
| `qwen3.6:35b-a3b-coding-mxfp8` | ~90 | ~38 GB | Richest: includes slopes, R², regional context |

The headline surprise: **`gemma4:e2b-mlx-bf16` is not a low-memory
option**, despite the "e2b" (effective 2B) naming. Its weights are
small but the default 256K-token context allocation dominates resident
memory. Use it on 32 GB+ Macs only.

The benchmark above measures `datasight ask` only. The other LLM-using
commands have very different shapes, and observed behavior on the
two qwen3.6 variants splits cleanly along those shapes:

| Workload | Calls a tool? | Output budget | Best of the two |
|---|---|---|---|
| `datasight ask` | yes (multi-turn agent) | small per turn | either; coding MoE for richer prose |
| `datasight tidy review` (LLM advisor) | yes (single `propose_reshapes` call) | 4 K | **general `qwen3.6`** |
| `datasight grounding repair` | no (long-form file rewrite) | 16 K | **`qwen3.6:35b-a3b-coding-mxfp8`** |

The split is consistent with what code-specialized fine-tunes are known
to trade: better long-form structured generation (winning grounding
repair, where the prompt and the output are both large) at the cost of
weaker tool-call adherence (losing `tidy review`, where the model has
to emit a structured tool call instead of free text). Observed in
practice: the coding variant silently emitted zero proposals on
`tidy review`'s `propose_reshapes`, while the general variant timed out
on grounding repair against the same database.

**Practical setup**: pull both models. Use `qwen3.6` as your default
`OLLAMA_MODEL`, and override per-call where the coding variant wins:

```bash
datasight grounding repair --model qwen3.6:35b-a3b-coding-mxfp8
datasight tidy review --model qwen3.6 # explicit default; useful in scripts
```

Both `tidy review` and `grounding repair` accept `--model`, as do
`ask`, `verify`, and `run`.

Apple Silicon recommendations by RAM tier:

| Unified memory | Recommended model | Why |
|---|---|---|
| 16 GB | `qwen2.5:7b` (GGUF) | Only option that fits with headroom for the OS, browser, and IDE. |
| 32 GB | `qwen2.5:7b` or `gemma4:e2b-mlx-bf16` | Either fits. Gemma is faster but its answers are tersest; pick based on whether you want interpretation or just raw data. |
| 48 GB+ | Both `qwen3.6` and `qwen3.6:35b-a3b-coding-mxfp8` (switch per command) | Sparse MoE (3B active params) — properly leverages Apple Silicon's unified memory + Metal. The two variants are complementary, not interchangeable: general for tool-use commands (`ask`, `tidy review`), coding for long-form generation (`grounding repair`, `ask` when you want richer prose). See workload table above. |

If you have an Apple Silicon machine but aren't sure which tag to use,
start with `qwen2.5:7b` (the cross-platform recommendation above). It
works on every backend and has the smallest memory footprint by far.

### On an HPC GPU node

If your HPC has GPU nodes, they typically unlock much larger models.
Expand Down
95 changes: 95 additions & 0 deletions docs/use/how-to/curate-with-tidy-review.md
Original file line number Diff line number Diff line change
Expand Up @@ -419,6 +419,101 @@ whose values would duplicate or drop rows — the transaction rolls back
in step 2, before the source is touched. The error message names the
table and the count it expected.

## Repair grounding after a reshape

A successful Tidy review changes the database schema — long-form
column and table names replace the wide-form ones. That breaks any
reference to the old column names in your grounding files
(`queries.yaml`, `schema_description.md`, `time_series.yaml`), which
the LLM agent reads on every turn. Stale grounding silently teaches
the agent to hallucinate against columns that no longer exist.

### How drift is detected

Every Tidy review apply (CLI or web) does two things in addition to
the reshape itself:

1. Writes a snapshot of the pre-apply schema to
`.datasight/grounding_snapshot.json`. This is the "before" picture
the LLM needs to rewrite the grounding files in context.
2. Runs a fast static check against the new schema. The CLI
surfaces drift in an interactive prompt (see the `--apply-all`
sections above); the web UI shows an orange "Grounding may be
stale" banner with a **Repair grounding** button.

### Run the repair

Web UI: click **Repair grounding** in the banner. The agent rewrites
the affected files, validates every SQL example against the live DB,
and applies the changes if validation passes.

CLI:

```bash
datasight grounding repair
```

This reads the snapshot, runs the LLM repair, prints a unified diff
of the proposed rewrites, and asks `Apply this diff? [y/N]` before
writing.

### Retry with a different model

Local LLMs sometimes time out on the repair (the prompt includes
your full grounding files plus both schemas, which can be large).
Retry with a different model — no need to re-run the reshape:

```bash
datasight grounding repair --model qwen3.6:35b-a3b-coding-mxfp8
```

`--model` overrides the configured `OLLAMA_MODEL` (or
`ANTHROPIC_MODEL`, etc.) for this one call.

The two LLM-using steps in the Tidy review flow have very different
shapes and reward different model variants — `tidy review`'s
proposal step is a tool call (favors general-purpose models),
`grounding repair` is a long-form file rewrite (favors
coding-specialized models). Both `tidy review` and `grounding
repair` accept `--model` so you can pick per call. See
[Choosing an LLM](../concepts/choosing-an-llm.md) for the per-workload
recommendation table.

### When the snapshot is missing

If you applied a reshape before snapshotting was wired in, or you
deleted `.datasight/grounding_snapshot.json`, the repair has nothing
to compare against. Pass `--from-csv` pointing at the original
wide-form source so the CLI can derive the pre-tidy schema from the
header row:

```bash
datasight grounding repair --from-csv generation_fuel_wide.csv
```

You can pass `--from-csv` multiple times for multi-file inputs;
each CSV becomes a single table named after the file stem.

### Preview without writing

Add `--dry-run` to skip the confirmation and the write. The diff
prints as usual, then the command exits — useful for inspecting the
LLM's proposal in CI or before committing the result:

```bash
datasight grounding repair --dry-run
```

### Check drift without repairing

```bash
datasight grounding check
```

Static, no LLM. Exits 0 when grounding is clean, 1 when drift
exists. Same logic as `datasight verify --static-only`, exposed
under a more discoverable name.

## Recipes

### Curate one wide table from the web UI
Expand Down
11 changes: 8 additions & 3 deletions docs/use/how-to/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,9 +112,14 @@ Alternatively, paste the key directly into a project's `.env` file instead.
OLLAMA_MODEL=qwen2.5:7b
```

`qwen2.5:7b` works well for CLI queries (`datasight ask`). For the web UI with
chart generation, `qwen2.5:14b` handles the more complex interactions better.
See [Choosing an AI provider](../concepts/choosing-an-llm.md) for hardware sizing guidance.
`qwen2.5:7b` works well for CLI queries (`datasight ask`) and is the safest
cross-platform default — it uses ~2 GB resident on Apple Silicon, so it fits
even on 16 GB Macs. For the web UI with chart generation, `qwen2.5:14b` handles
the more complex interactions better. On Apple Silicon with 48 GB+ of unified
memory, `qwen3.6:35b-a3b-coding-mxfp8` produces noticeably richer answers with
comparable decode speed (sparse MoE). See
[Choosing an AI provider](../concepts/choosing-an-llm.md#apple-silicon-mlx-native-models)
for measured per-model memory footprints and Apple-Silicon-specific guidance.

See the [Configuration reference](../../reference/configuration.md) for every supported
variable.
Expand Down
8 changes: 8 additions & 0 deletions frontend/src/App.svelte
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
import { sqlEditorStore } from "$lib/stores/sql_editor.svelte";
import { paletteStore } from "$lib/stores/palette.svelte";
import { tidyStore } from "$lib/stores/tidy.svelte";
import { groundingStore } from "$lib/stores/grounding.svelte";
import { exitExploreSession, getProjectStatus } from "$lib/api/projects";
import { loadSettings, loadLlmConfig } from "$lib/api/settings";
import { loadSchema, loadQueries, loadRecipes } from "$lib/api/schema";
Expand Down Expand Up @@ -114,6 +115,12 @@
loadMeasureCatalog(),
]);

// Refresh the always-on grounding pill. Fires after the schema
// load above so the server's drift check sees the up-to-date
// schema (the static check doesn't care about timing, but this
// keeps the lifecycle obvious). Cheap call — no LLM.
groundingStore.check();

// Run pending starter if one was selected on landing page
if (fromLanding) {
await maybeRunPendingStarter();
Expand Down Expand Up @@ -189,6 +196,7 @@
dashboardStore.clear();
sqlEditorStore.clearAll();
sessionStore.reset();
groundingStore.reset();
dashboardStore.currentView = "chat";
exportMode = false;
exportExcludeIndices = new Set();
Expand Down
Loading