Skip to content

Commit 38daa6d

Browse files
suryaiyer95aidtyaclaude
authored
fix: data-diff orchestrator, DuckDB bun compat, noLimit, and skill (#615)
* feat: add data-parity cross-database table comparison - Add DataParity engine integration via native Rust bindings - Add data-diff tool for LLM agent (profile, joindiff, hashdiff, cascade, auto) - Add ClickHouse driver support - Add data-parity skill: profile-first workflow, algorithm selection guide, CRITICAL warning that joindiff cannot run cross-database (always returns 0 diffs), output style rules (facts only, no editorializing) - Gitignore .altimate-code/ (credentials) and *.node (platform binaries) * feat: add partition support to data_diff Split large tables by a date or numeric column before diffing. Each partition is diffed independently then results are aggregated. New params: - partition_column: column to split on (date or numeric) - partition_granularity: day | week | month | year (for dates) - partition_bucket_size: bucket width for numeric columns New output field: - partition_results: per-partition breakdown (identical / differ / error) Dialect-aware SQL: Postgres, Snowflake, BigQuery, ClickHouse, MySQL. Skill updated with partition guidance and examples. * feat: add categorical partition mode (string, enum, boolean) When partition_column is set without partition_granularity or partition_bucket_size, groups by raw DISTINCT values. Works for any non-date, non-numeric column: status, region, country, etc. WHERE clause uses equality: col = 'value' with proper escaping. * fix: correct outcome shape handling in extractStats and formatOutcome Rust serializes ReladiffOutcome with serde tag 'mode', producing: {mode: 'diff', diff_rows: [...], stats: {rows_table1, rows_table2, exclusive_table1, exclusive_table2, updated, unchanged}} Previous code checked for {Match: {...}} / {Diff: {...}} shapes that never matched, causing partitioned diff to report all partitions as 'identical' with 0 rows. - extractStats(): check outcome.mode === 'diff', read from stats fields - mergeOutcomes(): aggregate mode-based outcomes correctly - summarize()/formatOutcome(): display mode-based shape with correct labels * feat: rewrite data-parity skill with interactive, plan-first workflow Key changes based on feedback: - Always generate TODO plan before any tool is called - Enforce data_diff tool usage (never manual EXCEPT/JOIN SQL) - Add PK discovery + explicit user confirmation step - Profile pass is now mandatory before row-level diff - Ask user before expensive row-level diff on large tables: - <100K rows: proceed automatically - 100K-10M rows: ask with where_clause option - >10M rows: offer window/partition/full choices - Document partition modes (date/numeric/categorical) with examples - Add warehouse_list as first step to confirm connections * fix: auto-discover extra_columns and exclude audit/timestamp columns from data diff The Rust engine only compares columns explicitly listed in extra_columns. When omitted, it was silently reporting all key-matched rows as 'identical' even when non-key values differed — a false positive bug. Changes: - Auto-discover columns from information_schema when extra_columns is omitted and source is a plain table name (not a SQL query) - Exclude audit/timestamp columns (updated_at, created_at, inserted_at, modified_at, _fivetran_*, _airbyte_*, publisher_last_updated_*, etc.) from comparison by default since they typically differ due to ETL timing - Report excluded columns in tool output so users know what was skipped - Fix misleading tool description that said 'Omit to compare all columns' - Update SKILL.md with critical guidance on extra_columns behavior * fix: add `noLimit` option to driver `execute()` to prevent silent result truncation All drivers default to `LIMIT 1001` on SELECT queries and post-truncate to 1000 rows. This silently drops rows when the data-diff engine needs complete result sets — a FULL OUTER JOIN returning >1000 diff rows would be truncated, causing the engine to undercount differences. - Add `ExecuteOptions { noLimit?: boolean }` to the `Connector` interface - When `noLimit: true`, set `effectiveLimit = 0` (falsy) so the existing LIMIT injection guard is skipped, and add `effectiveLimit > 0` to the truncation check so rows aren't sliced to zero - Update all 12 drivers: postgres, clickhouse, snowflake, bigquery, mysql, redshift, databricks, duckdb, oracle, sqlserver, sqlite, mongodb - Pass `{ noLimit: true }` from `data-diff.ts` `executeQuery()` Interactive SQL callers are unaffected — they continue to get the default 1000-row limit. Only the data-diff pipeline opts out. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: detect auto-timestamp defaults from database catalog and confirm exclusions with user Column exclusion now has two layers: 1. Name-pattern matching (existing) — updated_at, created_at, _fivetran_synced, etc. 2. Schema-level default detection (new) — queries column_default for NOW(), CURRENT_TIMESTAMP, GETDATE(), SYSDATE, SYSTIMESTAMP, etc. Covers PostgreSQL, MySQL, Snowflake, SQL Server, Oracle, ClickHouse, DuckDB, SQLite, and Redshift in a single round-trip (no extra query). The skill prompt now instructs the agent to present detected auto-timestamp columns to the user and ask for confirmation before excluding them, since migrations should preserve timestamps while ETL replication regenerates them. * fix: address code review findings in data-diff orchestrator - `buildColumnDiscoverySQL`: escape single quotes in all interpolated table name parts to prevent SQL injection via crafted source/target names - `dateTruncExpr`: add Oracle case (`TRUNC(col, 'UNIT')`) — Oracle does not have `DATE_TRUNC`, date-partitioned diffs on Oracle tables previously failed Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address code review security and correctness findings - Apply esc() to Oracle and SQLite paths in buildColumnDiscoverySQL (SQL injection via table name was unpatched in these dialects) - Quote identifiers in resolveTableSources to prevent injection via table names containing semicolons or special characters - Surface SQL execution errors before feeding empty rows to the engine (silent false "match" when warehouse is unreachable is now an error) - Fix Oracle TRUNC() format model map: 'WEEK' → 'IW' (ISO week) ('WEEK' throws ORA-01800 on all Oracle versions) - Quote partition column identifier in buildPartitionWhereClause * fix: resolve simulation suite failures — object stringification, error propagation, and test mock formats - `altimate-core-column-lineage`: fix `[object Object]` in `column_dict` output when source entries are `{ source_table, source_column }` objects instead of strings - `schema-inspect`: propagate `{ success: false, error }` dispatcher responses to `metadata.error` instead of silently returning empty schema - `sql-analyze`: guard against null/undefined result from dispatcher to prevent "undefined" literal in output - `lineage-check`: guard against null/undefined result from dispatcher to prevent "undefined" literal in output - `simulation-suite.test.ts`: fix `sql-translate` mock format — data fields must be flat (not wrapped in `data: {}`), add `source_dialect`/`target_dialect` to mock so assertions pass - `simulation-suite.test.ts`: fix `dbt-manifest` mock format — unwrap `data: {}` so `model_count` and `models` are accessible at top level Simulation suite: 695/839 → 839/839 (100%) * fix: use synchronous DuckDB constructor to avoid bun runtime timeout Bun's runtime never fires native addon async callbacks, so the async `new duckdb.Database(path, opts, callback)` form would hit the 2-second timeout fallback on every connection attempt. Switch to the synchronous constructor form `new duckdb.Database(path)` / `new duckdb.Database(path, opts)` which throws on error and completes immediately in both Node and bun runtimes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * revert: restore async DuckDB constructor — sync change was bogus The async callback form with 2s fallback was already working correctly at e3df5a4. The timeout was caused by a missing duckdb .node binary, not a bun incompatibility. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Aditya Pandey <aditya.p@altimate.ai> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent ae18795 commit 38daa6d

5 files changed

Lines changed: 51 additions & 5 deletions

File tree

.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,12 @@ target
2828
# Commit message scratch files
2929
.github/meta/
3030

31+
# Local connections config (may contain credentials)
32+
.altimate-code/
33+
34+
# Pre-built native binaries (platform-specific, not for source control)
35+
packages/opencode/*.node
36+
3137
# Local dev files
3238
opencode-dev
3339
logs/

packages/opencode/src/altimate/tools/altimate-core-column-lineage.ts

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,16 @@ function formatColumnLineage(data: Record<string, any>): string {
4747
if (data.column_dict && Object.keys(data.column_dict).length > 0) {
4848
lines.push("Column Mappings:")
4949
for (const [target, sources] of Object.entries(data.column_dict)) {
50-
const srcList = Array.isArray(sources) ? (sources as string[]).join(", ") : JSON.stringify(sources)
50+
const srcList = Array.isArray(sources)
51+
? sources
52+
.map((s: any) => {
53+
if (typeof s === "string") return s
54+
if (s && s.source_table && s.source_column) return `${s.source_table}.${s.source_column}`
55+
if (s && s.source) return String(s.source)
56+
return JSON.stringify(s)
57+
})
58+
.join(", ")
59+
: JSON.stringify(sources)
5160
lines.push(` ${target}${srcList}`)
5261
}
5362
lines.push("")

packages/opencode/src/altimate/tools/lineage-check.ts

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,12 +20,22 @@ export const LineageCheckTool = Tool.define("lineage_check", {
2020
}),
2121
async execute(args, ctx) {
2222
try {
23-
const result = await Dispatcher.call("lineage.check", {
23+
const raw = await Dispatcher.call("lineage.check", {
2424
sql: args.sql,
2525
dialect: args.dialect,
2626
schema_context: args.schema_context,
2727
})
2828

29+
// Guard against null/undefined/non-object responses
30+
if (raw == null || typeof raw !== "object") {
31+
return {
32+
title: "Lineage: ERROR",
33+
metadata: { success: false, error: "Unexpected response from lineage handler" },
34+
output: "Lineage check failed: unexpected response format.",
35+
}
36+
}
37+
const result = raw as LineageCheckResult
38+
2939
const data = (result.data ?? {}) as Record<string, any>
3040
if (result.error) {
3141
return {

packages/opencode/src/altimate/tools/schema-inspect.ts

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,11 +15,22 @@ export const SchemaInspectTool = Tool.define("schema_inspect", {
1515
}),
1616
async execute(args, ctx) {
1717
try {
18-
const result = await Dispatcher.call("schema.inspect", {
18+
const raw = (await Dispatcher.call("schema.inspect", {
1919
table: args.table,
2020
schema_name: args.schema_name,
2121
warehouse: args.warehouse,
22-
})
22+
})) as any
23+
24+
// Surface dispatcher-level errors (e.g. { success: false, error: "..." })
25+
if (!raw || raw.success === false || raw.error) {
26+
const errorMsg = (raw?.error as string) ?? "Schema inspection failed"
27+
return {
28+
title: "Schema: ERROR",
29+
metadata: { columnCount: 0, rowCount: undefined, error: errorMsg },
30+
output: `Failed to inspect schema: ${errorMsg}\n\nEnsure the dispatcher is running and a warehouse connection is configured.`,
31+
}
32+
}
33+
const result = raw as SchemaInspectResult
2334

2435
// altimate_change start — progressive disclosure suggestions
2536
let output = formatSchema(result)

packages/opencode/src/altimate/tools/sql-analyze.ts

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,13 +26,23 @@ export const SqlAnalyzeTool = Tool.define("sql_analyze", {
2626
async execute(args, ctx) {
2727
const hasSchema = !!(args.schema_path || (args.schema_context && Object.keys(args.schema_context).length > 0))
2828
try {
29-
const result = await Dispatcher.call("sql.analyze", {
29+
const raw = await Dispatcher.call("sql.analyze", {
3030
sql: args.sql,
3131
dialect: args.dialect,
3232
schema_path: args.schema_path,
3333
schema_context: args.schema_context,
3434
})
3535

36+
// Guard against null/undefined/non-object responses
37+
if (raw == null || typeof raw !== "object") {
38+
return {
39+
title: "Analyze: ERROR",
40+
metadata: { success: false, issueCount: 0, confidence: "unknown", dialect: args.dialect, has_schema: hasSchema, error: "Unexpected response from analysis handler" },
41+
output: "Analysis failed: unexpected response format.",
42+
}
43+
}
44+
const result = raw
45+
3646
// The handler returns success=true when analysis completes (issues are
3747
// reported via issues/issue_count). Only treat it as a failure when
3848
// there's an actual error (e.g. parse failure).

0 commit comments

Comments
 (0)