fix: data-diff orchestrator, DuckDB bun compat, noLimit, and skill (#615)

suryaiyer95 · aidtya · claude · web-flow · commit 38daa6da1967 · 2026-04-02T01:46:06.000-07:00
* feat: add data-parity cross-database table comparison - Add DataParity engine integration via native Rust bindings - Add data-diff tool for LLM agent (profile, joindiff, hashdiff, cascade, auto) - Add ClickHouse driver support - Add data-parity skill: profile-first workflow, algorithm selection guide, CRITICAL warning that joindiff cannot run cross-database (always returns 0 diffs), output style rules (facts only, no editorializing) - Gitignore .altimate-code/ (credentials) and *.node (platform binaries) * feat: add partition support to data_diff Split large tables by a date or numeric column before diffing. Each partition is diffed independently then results are aggregated. New params: - partition_column: column to split on (date or numeric) - partition_granularity: day | week | month | year (for dates) - partition_bucket_size: bucket width for numeric columns New output field: - partition_results: per-partition breakdown (identical / differ / error) Dialect-aware SQL: Postgres, Snowflake, BigQuery, ClickHouse, MySQL. Skill updated with partition guidance and examples. * feat: add categorical partition mode (string, enum, boolean) When partition_column is set without partition_granularity or partition_bucket_size, groups by raw DISTINCT values. Works for any non-date, non-numeric column: status, region, country, etc. WHERE clause uses equality: col = 'value' with proper escaping. * fix: correct outcome shape handling in extractStats and formatOutcome Rust serializes ReladiffOutcome with serde tag 'mode', producing: {mode: 'diff', diff_rows: [...], stats: {rows_table1, rows_table2, exclusive_table1, exclusive_table2, updated, unchanged}} Previous code checked for {Match: {...}} / {Diff: {...}} shapes that never matched, causing partitioned diff to report all partitions as 'identical' with 0 rows. - extractStats(): check outcome.mode === 'diff', read from stats fields - mergeOutcomes(): aggregate mode-based outcomes correctly - summarize()/formatOutcome(): display mode-based shape with correct labels * feat: rewrite data-parity skill with interactive, plan-first workflow Key changes based on feedback: - Always generate TODO plan before any tool is called - Enforce data_diff tool usage (never manual EXCEPT/JOIN SQL) - Add PK discovery + explicit user confirmation step - Profile pass is now mandatory before row-level diff - Ask user before expensive row-level diff on large tables: - <100K rows: proceed automatically - 100K-10M rows: ask with where_clause option - >10M rows: offer window/partition/full choices - Document partition modes (date/numeric/categorical) with examples - Add warehouse_list as first step to confirm connections * fix: auto-discover extra_columns and exclude audit/timestamp columns from data diff The Rust engine only compares columns explicitly listed in extra_columns. When omitted, it was silently reporting all key-matched rows as 'identical' even when non-key values differed — a false positive bug. Changes: - Auto-discover columns from information_schema when extra_columns is omitted and source is a plain table name (not a SQL query) - Exclude audit/timestamp columns (updated_at, created_at, inserted_at, modified_at, _fivetran_*, _airbyte_*, publisher_last_updated_*, etc.) from comparison by default since they typically differ due to ETL timing - Report excluded columns in tool output so users know what was skipped - Fix misleading tool description that said 'Omit to compare all columns' - Update SKILL.md with critical guidance on extra_columns behavior * fix: add `noLimit` option to driver `execute()` to prevent silent result truncation All drivers default to `LIMIT 1001` on SELECT queries and post-truncate to 1000 rows. This silently drops rows when the data-diff engine needs complete result sets — a FULL OUTER JOIN returning >1000 diff rows would be truncated, causing the engine to undercount differences. - Add `ExecuteOptions { noLimit?: boolean }` to the `Connector` interface - When `noLimit: true`, set `effectiveLimit = 0` (falsy) so the existing LIMIT injection guard is skipped, and add `effectiveLimit > 0` to the truncation check so rows aren't sliced to zero - Update all 12 drivers: postgres, clickhouse, snowflake, bigquery, mysql, redshift, databricks, duckdb, oracle, sqlserver, sqlite, mongodb - Pass `{ noLimit: true }` from `data-diff.ts` `executeQuery()` Interactive SQL callers are unaffected — they continue to get the default 1000-row limit. Only the data-diff pipeline opts out. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: detect auto-timestamp defaults from database catalog and confirm exclusions with user Column exclusion now has two layers: 1. Name-pattern matching (existing) — updated_at, created_at, _fivetran_synced, etc. 2. Schema-level default detection (new) — queries column_default for NOW(), CURRENT_TIMESTAMP, GETDATE(), SYSDATE, SYSTIMESTAMP, etc. Covers PostgreSQL, MySQL, Snowflake, SQL Server, Oracle, ClickHouse, DuckDB, SQLite, and Redshift in a single round-trip (no extra query). The skill prompt now instructs the agent to present detected auto-timestamp columns to the user and ask for confirmation before excluding them, since migrations should preserve timestamps while ETL replication regenerates them. * fix: address code review findings in data-diff orchestrator - `buildColumnDiscoverySQL`: escape single quotes in all interpolated table name parts to prevent SQL injection via crafted source/target names - `dateTruncExpr`: add Oracle case (`TRUNC(col, 'UNIT')`) — Oracle does not have `DATE_TRUNC`, date-partitioned diffs on Oracle tables previously failed Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address code review security and correctness findings - Apply esc() to Oracle and SQLite paths in buildColumnDiscoverySQL (SQL injection via table name was unpatched in these dialects) - Quote identifiers in resolveTableSources to prevent injection via table names containing semicolons or special characters - Surface SQL execution errors before feeding empty rows to the engine (silent false "match" when warehouse is unreachable is now an error) - Fix Oracle TRUNC() format model map: 'WEEK' → 'IW' (ISO week) ('WEEK' throws ORA-01800 on all Oracle versions) - Quote partition column identifier in buildPartitionWhereClause * fix: resolve simulation suite failures — object stringification, error propagation, and test mock formats - `altimate-core-column-lineage`: fix `[object Object]` in `column_dict` output when source entries are `{ source_table, source_column }` objects instead of strings - `schema-inspect`: propagate `{ success: false, error }` dispatcher responses to `metadata.error` instead of silently returning empty schema - `sql-analyze`: guard against null/undefined result from dispatcher to prevent "undefined" literal in output - `lineage-check`: guard against null/undefined result from dispatcher to prevent "undefined" literal in output - `simulation-suite.test.ts`: fix `sql-translate` mock format — data fields must be flat (not wrapped in `data: {}`), add `source_dialect`/`target_dialect` to mock so assertions pass - `simulation-suite.test.ts`: fix `dbt-manifest` mock format — unwrap `data: {}` so `model_count` and `models` are accessible at top level Simulation suite: 695/839 → 839/839 (100%) * fix: use synchronous DuckDB constructor to avoid bun runtime timeout Bun's runtime never fires native addon async callbacks, so the async `new duckdb.Database(path, opts, callback)` form would hit the 2-second timeout fallback on every connection attempt. Switch to the synchronous constructor form `new duckdb.Database(path)` / `new duckdb.Database(path, opts)` which throws on error and completes immediately in both Node and bun runtimes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * revert: restore async DuckDB constructor — sync change was bogus The async callback form with 2s fallback was already working correctly at e3df5a4. The timeout was caused by a missing duckdb .node binary, not a bun incompatibility. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Aditya Pandey <aditya.p@altimate.ai> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
diff --git a/.gitignore b/.gitignore
@@ -28,6 +28,12 @@ target
 # Commit message scratch files
 .github/meta/
 
+# Local connections config (may contain credentials)
+.altimate-code/
+
+# Pre-built native binaries (platform-specific, not for source control)
+packages/opencode/*.node
+
 # Local dev files
 opencode-dev
 logs/
diff --git a/packages/opencode/src/altimate/tools/altimate-core-column-lineage.ts b/packages/opencode/src/altimate/tools/altimate-core-column-lineage.ts
@@ -47,7 +47,16 @@ function formatColumnLineage(data: Record<string, any>): string {
   if (data.column_dict && Object.keys(data.column_dict).length > 0) {
     lines.push("Column Mappings:")
     for (const [target, sources] of Object.entries(data.column_dict)) {
-      const srcList = Array.isArray(sources) ? (sources as string[]).join(", ") : JSON.stringify(sources)
+      const srcList = Array.isArray(sources)
+        ? sources
+            .map((s: any) => {
+              if (typeof s === "string") return s
+              if (s && s.source_table && s.source_column) return `${s.source_table}.${s.source_column}`
+              if (s && s.source) return String(s.source)
+              return JSON.stringify(s)
+            })
+            .join(", ")
+        : JSON.stringify(sources)
       lines.push(`  ${target} ← ${srcList}`)
     }
     lines.push("")
diff --git a/packages/opencode/src/altimate/tools/lineage-check.ts b/packages/opencode/src/altimate/tools/lineage-check.ts
@@ -20,12 +20,22 @@ export const LineageCheckTool = Tool.define("lineage_check", {
   }),
   async execute(args, ctx) {
     try {
-      const result = await Dispatcher.call("lineage.check", {
+      const raw = await Dispatcher.call("lineage.check", {
         sql: args.sql,
         dialect: args.dialect,
         schema_context: args.schema_context,
       })
 
+      // Guard against null/undefined/non-object responses
+      if (raw == null || typeof raw !== "object") {
+        return {
+          title: "Lineage: ERROR",
+          metadata: { success: false, error: "Unexpected response from lineage handler" },
+          output: "Lineage check failed: unexpected response format.",
+        }
+      }
+      const result = raw as LineageCheckResult
+
       const data = (result.data ?? {}) as Record<string, any>
       if (result.error) {
         return {
diff --git a/packages/opencode/src/altimate/tools/schema-inspect.ts b/packages/opencode/src/altimate/tools/schema-inspect.ts
@@ -15,11 +15,22 @@ export const SchemaInspectTool = Tool.define("schema_inspect", {
   }),
   async execute(args, ctx) {
     try {
-      const result = await Dispatcher.call("schema.inspect", {
+      const raw = (await Dispatcher.call("schema.inspect", {
         table: args.table,
         schema_name: args.schema_name,
         warehouse: args.warehouse,
-      })
+      })) as any
+
+      // Surface dispatcher-level errors (e.g. { success: false, error: "..." })
+      if (!raw || raw.success === false || raw.error) {
+        const errorMsg = (raw?.error as string) ?? "Schema inspection failed"
+        return {
+          title: "Schema: ERROR",
+          metadata: { columnCount: 0, rowCount: undefined, error: errorMsg },
+          output: `Failed to inspect schema: ${errorMsg}\n\nEnsure the dispatcher is running and a warehouse connection is configured.`,
+        }
+      }
+      const result = raw as SchemaInspectResult
 
       // altimate_change start — progressive disclosure suggestions
       let output = formatSchema(result)
diff --git a/packages/opencode/src/altimate/tools/sql-analyze.ts b/packages/opencode/src/altimate/tools/sql-analyze.ts
@@ -26,13 +26,23 @@ export const SqlAnalyzeTool = Tool.define("sql_analyze", {
   async execute(args, ctx) {
     const hasSchema = !!(args.schema_path || (args.schema_context && Object.keys(args.schema_context).length > 0))
     try {
-      const result = await Dispatcher.call("sql.analyze", {
+      const raw = await Dispatcher.call("sql.analyze", {
         sql: args.sql,
         dialect: args.dialect,
         schema_path: args.schema_path,
         schema_context: args.schema_context,
       })
 
+      // Guard against null/undefined/non-object responses
+      if (raw == null || typeof raw !== "object") {
+        return {
+          title: "Analyze: ERROR",
+          metadata: { success: false, issueCount: 0, confidence: "unknown", dialect: args.dialect, has_schema: hasSchema, error: "Unexpected response from analysis handler" },
+          output: "Analysis failed: unexpected response format.",
+        }
+      }
+      const result = raw
+
       // The handler returns success=true when analysis completes (issues are
       // reported via issues/issue_count). Only treat it as a failure when
       // there's an actual error (e.g. parse failure).