From ce37d140f1455d36bf06339e894e37deeafa9ff8 Mon Sep 17 00:00:00 2001 From: Aastha Agrrawal Date: Wed, 27 May 2026 16:51:21 +0530 Subject: [PATCH 1/2] docs: add type mapping manifests with v1 Introduces docs/type-mapping-manifests/ as a home for versioned cross-engine type contracts. v1 documents observed type behavior across Hive 1.1 (HMS), Iceberg 1.2 (table format v2), Spark 3.1, and Trino 400, organized by logical-type family. Each engine cell captures both bidirectional create paths: native -> engine read, and engine DDL -> native storage. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/type-mapping-manifests/README.md | 48 +++++++++++++ docs/type-mapping-manifests/v1/README.md | 88 ++++++++++++++++++++++++ 2 files changed, 136 insertions(+) create mode 100644 docs/type-mapping-manifests/README.md create mode 100644 docs/type-mapping-manifests/v1/README.md diff --git a/docs/type-mapping-manifests/README.md b/docs/type-mapping-manifests/README.md new file mode 100644 index 000000000..79962afc2 --- /dev/null +++ b/docs/type-mapping-manifests/README.md @@ -0,0 +1,48 @@ +# Type Mapping Manifests + +This directory hosts versioned **type mapping manifests** that codify how +primitive, complex, and logical types map across the engines and table formats +supported by OpenHouse (currently Hive and Iceberg as storage formats; Spark +and Trino as engines, with HiveQL and PyArrow coverage in progress). + +Each manifest is a contract: a single source of truth for cross-engine type +semantics. Consumers (query planners, schema validators, type-system code +generators such as Calcite `TypeFactory` or Coral `CoralDataType`, and +authoring agents that inject design-time transformation shims) use it to keep +behavior consistent as dialects evolve. + +## Layout + +``` +type-mapping-manifests/ +├── README.md # this file +├── v1/ +│ └── README.md # the v1 manifest — table-presentation form +├── v2/ +│ └── README.md # future versions +``` + +Each version's manifest lives entirely in its `README.md`. The tables in that +file are the canonical contract — no separate machine-readable artifact ships +in v1. If future consumers require structured input (YAML, JSON), a generated +form will be added alongside the markdown. + +## Versioning policy + +- Each `vN/` directory is an **immutable release**. Once published, entries + are not edited in place; corrections ship as a new version. +- Versions are additive where possible. A new version may extend coverage to + new dialects, new dialect versions, or new types, and may revise the + mapping for an existing entry when behavioral evidence warrants it. +- Each version declares the exact dialect and format versions it covers in + its `README.md` header; consumers should pin to a specific manifest version. + +## How a version is produced + +Manifest entries are grounded in empirically observed engine behavior: +contract tables are materialized in each dialect via native APIs, and for +each table the persisted schema is compared against the schema inferred by +every reading engine. Divergences become manifest entries. + +The contract tables themselves are not published in this repository; only the +resulting manifest is. diff --git a/docs/type-mapping-manifests/v1/README.md b/docs/type-mapping-manifests/v1/README.md new file mode 100644 index 000000000..d5b0acb22 --- /dev/null +++ b/docs/type-mapping-manifests/v1/README.md @@ -0,0 +1,88 @@ +# Type Mapping Manifest — v1 + +Covers Hive 1.1 (HMS), Iceberg 1.2 (table format v2), Spark 3.1, Trino 400. +HiveQL and PyArrow not yet covered. + +In each engine cell, the top line is `native → engine` (column created via the +storage format's native API, read by the engine); the bottom line is +`engine → native` (column authored via engine DDL, persisted in storage). +`—` = no equivalent. `⚠` = silent divergence. `❌` = DDL rejected. +`_TBD_` = not yet observed. + +## Boolean + +| Logical type | Storage | Spark | Trino | +|--------------|------------|------------------------------------------------|------------------------------------------------| +| boolean | Iceberg v2 | `boolean` → `boolean`
`BOOLEAN` → `boolean` | `boolean` → `boolean`
`BOOLEAN` → `boolean` | +| boolean | Hive (HMS) | _TBD_ → _TBD_
`BOOLEAN` → `boolean` | `boolean` → `boolean`
`BOOLEAN` → `boolean` | + +## Integers + +| Logical type | Storage | Spark | Trino | +|-------------------------|------------|-------------------------------------|----------------------------------------------------| +| 8-bit int (`tinyint`) | Iceberg v2 | —
— | —
`TINYINT` → ❌ rejected | +| 8-bit int (`tinyint`) | Hive (HMS) | _TBD_
_TBD_ | `tinyint` → `tinyint`
`TINYINT` → `tinyint` | +| 16-bit int (`smallint`) | Iceberg v2 | —
— | —
`SMALLINT` → ❌ rejected | +| 16-bit int (`smallint`) | Hive (HMS) | _TBD_
_TBD_ | `smallint` → `smallint`
`SMALLINT` → `smallint` | +| 32-bit int | Iceberg v2 | `int` → `int`
_TBD_ → `int` | `int` → `integer`
`INTEGER` → `int` | +| 32-bit int | Hive (HMS) | _TBD_
_TBD_ | `int` → `integer`
`INTEGER` → `int` | +| 64-bit int | Iceberg v2 | `long` → `bigint`
_TBD_ → `long` | `long` → `bigint`
`BIGINT` → `long` | +| 64-bit int | Hive (HMS) | _TBD_
_TBD_ | `bigint` → `bigint`
`BIGINT` → `bigint` | + +## Floats + +| Logical type | Storage | Spark | Trino | +|--------------|------------|-----------------------------------------|--------------------------------------------| +| 32-bit float | Iceberg v2 | `float` → `float`
_TBD_ → `float` | `float` → `real`
`REAL` → `float` | +| 32-bit float | Hive (HMS) | _TBD_
_TBD_ | `float` → `real`
`REAL` → `float` | +| 64-bit float | Iceberg v2 | `double` → `double`
_TBD_ → `double` | `double` → `double`
`DOUBLE` → `double` | +| 64-bit float | Hive (HMS) | _TBD_
_TBD_ | `double` → `double`
`DOUBLE` → `double` | + +## Strings + +| Logical type | Storage | Spark | Trino | +|------------------------|------------|------------------------------------------|------------------------------------------| +| Variable-length string | Iceberg v2 | `string` → `string`
_TBD_ → `string` | `string` → `varchar`
_TBD_ | +| Variable-length string | Hive (HMS) | _TBD_
_TBD_ | `string` → `varchar`
_TBD_ | +| `VARCHAR(N)` | Iceberg v2 | —
`VARCHAR(100)` → `string` ⚠ | —
_TBD_ | +| `VARCHAR(N)` | Hive (HMS) | _TBD_
`VARCHAR(100)` → `varchar(100)` | `varchar(100)` → `varchar(100)`
_TBD_ | +| `CHAR(N)` | Iceberg v2 | —
`CHAR(10)` → `string` ⚠ | —
_TBD_ | +| `CHAR(N)` | Hive (HMS) | _TBD_
`CHAR(10)` → `char(10)` | `char(10)` → `char(10)`
_TBD_ | + +## Temporal + +| Logical type | Storage | Spark | Trino | +|------------------------|------------|-------------------------------------------|----------------| +| `TIMESTAMP` (TZ-aware) | Iceberg v2 | _TBD_
`TIMESTAMP` → `timestamptz(6)` ⚠ | _TBD_
_TBD_ | +| `TIMESTAMP` | Hive (HMS) | _TBD_
`TIMESTAMP` → `timestamp` | _TBD_
_TBD_ | + +## Binary and logical overlays + +| Logical type | Storage | Spark | Trino | +|-----------------|------------|-----------------------------------------|--------------------------------------| +| Variable binary | Iceberg v2 | `binary` → `binary`
_TBD_ → `binary` | `binary` → `varbinary`
_TBD_ | +| Variable binary | Hive (HMS) | _TBD_
_TBD_ | `binary` → `varbinary`
_TBD_ | +| `fixed(N)` | Iceberg v2 | `fixed(16)` → `binary` ⚠
_TBD_ | `fixed(16)` → `varbinary` ⚠
_TBD_ | +| `fixed(N)` | Hive (HMS) | —
— | —
— | +| `uuid` | Iceberg v2 | _TBD_
_TBD_ | `uuid` → `uuid`
_TBD_ | +| `uuid` | Hive (HMS) | —
— | —
— | + +## Containers + +| Logical type | Storage | Spark | Trino | +|---|---|---|---| +| `ARRAY` | Iceberg v2 | `list` → `array`
`ARRAY` → `list` | _TBD_
_TBD_ | +| `ARRAY` | Hive (HMS) | `array` → `array`
`ARRAY` → `array` | _TBD_
_TBD_ | +| `MAP` | Iceberg v2 | `map` → `map`
`MAP` → `map` | _TBD_
_TBD_ | +| `MAP` | Hive (HMS) | `map` → `map`
`MAP` → `map` | _TBD_
_TBD_ | +| `STRUCT<…>` | Iceberg v2 | `struct` → `struct` ⚠
`STRUCT` → `struct` | _TBD_
_TBD_ | +| `STRUCT<…>` | Hive (HMS) | `struct` → `struct`
`STRUCT` → `struct` | _TBD_
_TBD_ | + +## Union + +| Logical type | Storage | Spark | Trino | +|--------------------------|------------|----------------------------------------------------------------------------------------------------------------------------|------------| +| `UNIONTYPE` | Iceberg v2 | —
— | —
— | +| `UNIONTYPE` | Hive (HMS) | `uniontype` → `int` ⚠
— | _TBD_
— | +| `UNIONTYPE` | Iceberg v2 | —
— | —
— | +| `UNIONTYPE` | Hive (HMS) | `uniontype` → `struct` ⚠
— | _TBD_
— | From 242467a33852d6d4dad70e0cf691a5f33899118f Mon Sep 17 00:00:00 2001 From: Aastha Agrrawal Date: Thu, 28 May 2026 02:17:13 -0700 Subject: [PATCH 2/2] docs: expand v1 manifest with full type coverage Adds decimal, date, time, timestamp variants, and Trino-only types (json, ipaddress). Completes string, float, binary, and array rows with previously-missing Spark and Trino observations. Renames "Temporal" section to "Date and time". Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/type-mapping-manifests/v1/README.md | 109 ++++++++++++++--------- 1 file changed, 66 insertions(+), 43 deletions(-) diff --git a/docs/type-mapping-manifests/v1/README.md b/docs/type-mapping-manifests/v1/README.md index d5b0acb22..62f607e4b 100644 --- a/docs/type-mapping-manifests/v1/README.md +++ b/docs/type-mapping-manifests/v1/README.md @@ -14,68 +14,91 @@ storage format's native API, read by the engine); the bottom line is | Logical type | Storage | Spark | Trino | |--------------|------------|------------------------------------------------|------------------------------------------------| | boolean | Iceberg v2 | `boolean` → `boolean`
`BOOLEAN` → `boolean` | `boolean` → `boolean`
`BOOLEAN` → `boolean` | -| boolean | Hive (HMS) | _TBD_ → _TBD_
`BOOLEAN` → `boolean` | `boolean` → `boolean`
`BOOLEAN` → `boolean` | +| boolean | Hive (HMS) | _TBD_
`BOOLEAN` → `boolean` | `boolean` → `boolean`
`BOOLEAN` → `boolean` | ## Integers -| Logical type | Storage | Spark | Trino | -|-------------------------|------------|-------------------------------------|----------------------------------------------------| -| 8-bit int (`tinyint`) | Iceberg v2 | —
— | —
`TINYINT` → ❌ rejected | -| 8-bit int (`tinyint`) | Hive (HMS) | _TBD_
_TBD_ | `tinyint` → `tinyint`
`TINYINT` → `tinyint` | -| 16-bit int (`smallint`) | Iceberg v2 | —
— | —
`SMALLINT` → ❌ rejected | -| 16-bit int (`smallint`) | Hive (HMS) | _TBD_
_TBD_ | `smallint` → `smallint`
`SMALLINT` → `smallint` | -| 32-bit int | Iceberg v2 | `int` → `int`
_TBD_ → `int` | `int` → `integer`
`INTEGER` → `int` | -| 32-bit int | Hive (HMS) | _TBD_
_TBD_ | `int` → `integer`
`INTEGER` → `int` | -| 64-bit int | Iceberg v2 | `long` → `bigint`
_TBD_ → `long` | `long` → `bigint`
`BIGINT` → `long` | -| 64-bit int | Hive (HMS) | _TBD_
_TBD_ | `bigint` → `bigint`
`BIGINT` → `bigint` | +| Logical type | Storage | Spark | Trino | +|-------------------------|------------|----------------------------|----------------------------------------------------| +| 8-bit int (`tinyint`) | Iceberg v2 | —
_TBD_ | —
`TINYINT` → ❌ rejected | +| 8-bit int (`tinyint`) | Hive (HMS) | _TBD_
_TBD_ | `tinyint` → `tinyint`
`TINYINT` → `tinyint` | +| 16-bit int (`smallint`) | Iceberg v2 | —
_TBD_ | —
`SMALLINT` → ❌ rejected | +| 16-bit int (`smallint`) | Hive (HMS) | _TBD_
_TBD_ | `smallint` → `smallint`
`SMALLINT` → `smallint` | +| 32-bit int | Iceberg v2 | `int` → `int`
_TBD_ | `int` → `integer`
`INTEGER` → `int` | +| 32-bit int | Hive (HMS) | _TBD_
_TBD_ | `int` → `integer`
`INTEGER` → `int` | +| 64-bit int | Iceberg v2 | `long` → `bigint`
_TBD_ | `long` → `bigint`
`BIGINT` → `long` | +| 64-bit int | Hive (HMS) | _TBD_
_TBD_ | `bigint` → `bigint`
`BIGINT` → `bigint` | ## Floats -| Logical type | Storage | Spark | Trino | -|--------------|------------|-----------------------------------------|--------------------------------------------| -| 32-bit float | Iceberg v2 | `float` → `float`
_TBD_ → `float` | `float` → `real`
`REAL` → `float` | -| 32-bit float | Hive (HMS) | _TBD_
_TBD_ | `float` → `real`
`REAL` → `float` | -| 64-bit float | Iceberg v2 | `double` → `double`
_TBD_ → `double` | `double` → `double`
`DOUBLE` → `double` | -| 64-bit float | Hive (HMS) | _TBD_
_TBD_ | `double` → `double`
`DOUBLE` → `double` | +| Logical type | Storage | Spark | Trino | +|--------------|------------|--------------------------------------------|--------------------------------------------| +| 32-bit float | Iceberg v2 | `float` → `float`
`FLOAT` → `float` | `float` → `real`
`REAL` → `float` | +| 32-bit float | Hive (HMS) | _TBD_
`FLOAT` → `float` | `float` → `real`
`REAL` → `float` | +| 64-bit float | Iceberg v2 | `double` → `double`
`DOUBLE` → `double` | `double` → `double`
`DOUBLE` → `double` | +| 64-bit float | Hive (HMS) | _TBD_
`DOUBLE` → `double` | `double` → `double`
`DOUBLE` → `double` | -## Strings +## Decimal -| Logical type | Storage | Spark | Trino | -|------------------------|------------|------------------------------------------|------------------------------------------| -| Variable-length string | Iceberg v2 | `string` → `string`
_TBD_ → `string` | `string` → `varchar`
_TBD_ | -| Variable-length string | Hive (HMS) | _TBD_
_TBD_ | `string` → `varchar`
_TBD_ | -| `VARCHAR(N)` | Iceberg v2 | —
`VARCHAR(100)` → `string` ⚠ | —
_TBD_ | -| `VARCHAR(N)` | Hive (HMS) | _TBD_
`VARCHAR(100)` → `varchar(100)` | `varchar(100)` → `varchar(100)`
_TBD_ | -| `CHAR(N)` | Iceberg v2 | —
`CHAR(10)` → `string` ⚠ | —
_TBD_ | -| `CHAR(N)` | Hive (HMS) | _TBD_
`CHAR(10)` → `char(10)` | `char(10)` → `char(10)`
_TBD_ | +| Logical type | Storage | Spark | Trino | +|----------------|------------|--------------------------------------------|----------------| +| `decimal(p,s)` | Iceberg v2 | _TBD_
`DECIMAL(10,2)` → `decimal(10,2)` | _TBD_
_TBD_ | +| `decimal(p,s)` | Hive (HMS) | _TBD_
`DECIMAL(10,2)` → `decimal(10,2)` | _TBD_
_TBD_ | -## Temporal +## Strings -| Logical type | Storage | Spark | Trino | -|------------------------|------------|-------------------------------------------|----------------| -| `TIMESTAMP` (TZ-aware) | Iceberg v2 | _TBD_
`TIMESTAMP` → `timestamptz(6)` ⚠ | _TBD_
_TBD_ | -| `TIMESTAMP` | Hive (HMS) | _TBD_
`TIMESTAMP` → `timestamp` | _TBD_
_TBD_ | +| Logical type | Storage | Spark | Trino | +|------------------------|------------|-----------------------------------------------|--------------------------------------------------------------| +| Variable-length string | Iceberg v2 | `string` → `string`
`STRING` → `string` | `string` → `varchar`
`VARCHAR` → `string` | +| Variable-length string | Hive (HMS) | _TBD_
`STRING` → `string` | `string` → `varchar`
`VARCHAR` → `string` | +| `VARCHAR(N)` | Iceberg v2 | —
`VARCHAR(100)` → `string` ⚠ bound erased | —
`VARCHAR(10)` → `string` ⚠ bound erased | +| `VARCHAR(N)` | Hive (HMS) | _TBD_
`VARCHAR(100)` → `varchar(100)` | `varchar(N)` → `varchar(N)`
`VARCHAR(10)` → `varchar(10)` | +| `CHAR(N)` | Iceberg v2 | —
`CHAR(10)` → `string` ⚠ bound erased | —
`CHAR(10)` → ❌ rejected | +| `CHAR(N)` | Hive (HMS) | _TBD_
`CHAR(10)` → `char(10)` | `char(N)` → `char(N)`
`CHAR(10)` → `char(10)` | ## Binary and logical overlays -| Logical type | Storage | Spark | Trino | -|-----------------|------------|-----------------------------------------|--------------------------------------| -| Variable binary | Iceberg v2 | `binary` → `binary`
_TBD_ → `binary` | `binary` → `varbinary`
_TBD_ | -| Variable binary | Hive (HMS) | _TBD_
_TBD_ | `binary` → `varbinary`
_TBD_ | -| `fixed(N)` | Iceberg v2 | `fixed(16)` → `binary` ⚠
_TBD_ | `fixed(16)` → `varbinary` ⚠
_TBD_ | -| `fixed(N)` | Hive (HMS) | —
— | —
— | -| `uuid` | Iceberg v2 | _TBD_
_TBD_ | `uuid` → `uuid`
_TBD_ | -| `uuid` | Hive (HMS) | —
— | —
— | +| Logical type | Storage | Spark | Trino | +|-----------------|------------|-------------------------------------------------|----------------------------------------------------| +| Variable binary | Iceberg v2 | `binary` → `binary`
`BINARY` → `binary` | `binary` → `varbinary`
`VARBINARY` → `binary` | +| Variable binary | Hive (HMS) | _TBD_
`BINARY` → `binary` | `binary` → `varbinary`
`VARBINARY` → `binary` | +| `fixed(N)` | Iceberg v2 | `fixed(16)` → `binary` ⚠ length erased
_TBD_ | `fixed(16)` → `varbinary` ⚠ length erased
_TBD_ | +| `fixed(N)` | Hive (HMS) | —
— | —
— | +| `uuid` | Iceberg v2 | _TBD_
_TBD_ | `uuid` → `uuid`
_TBD_ | +| `uuid` | Hive (HMS) | —
— | —
— | + +## Date and time + +| Logical type | Storage | Spark | Trino | +|---------------------|------------|-----------------------------------------------------------------------------|-----------------------------------------------------------| +| `date` | Iceberg v2 | _TBD_
`DATE` → `date` | _TBD_
`DATE` → `date` | +| `date` | Hive (HMS) | _TBD_
`DATE` → `date` | _TBD_
`DATE` → `date` | +| `time` | Iceberg v2 | _TBD_
_TBD_ | `time` → `time(6)`
_TBD_ | +| `time` | Hive (HMS) | —
— | —
— | +| `timestamp` (no TZ) | Iceberg v2 | _TBD_
⚠ no Spark DDL path (TIMESTAMP silently produces `timestamptz(6)`) | `timestamp(6)` → `timestamp(6)`
_TBD_ | +| `timestamp` (no TZ) | Hive (HMS) | _TBD_
`TIMESTAMP` → `timestamp` | `timestamp` → `timestamp(3)`
_TBD_ | +| `timestamp` with TZ | Iceberg v2 | _TBD_
`TIMESTAMP` → `timestamptz(6)` ⚠ silent TZ injection | `timestamptz(6)` → `timestamp(6) with time zone`
_TBD_ | +| `timestamp` with TZ | Hive (HMS) | —
— | —
— | + +## Trino-only types + +These types exist in Trino's type system but have no equivalent in either Hive +or Iceberg storage; DDL targeting either format is rejected at creation time. + +| Logical type | Storage | Spark | Trino | +|--------------|---------|-------|-------------------------------| +| `json` | — | — | —
`JSON` → ❌ rejected | +| `ipaddress` | — | — | —
`IPADDRESS` → ❌ rejected | ## Containers | Logical type | Storage | Spark | Trino | |---|---|---|---| -| `ARRAY` | Iceberg v2 | `list` → `array`
`ARRAY` → `list` | _TBD_
_TBD_ | -| `ARRAY` | Hive (HMS) | `array` → `array`
`ARRAY` → `array` | _TBD_
_TBD_ | +| `ARRAY` | Iceberg v2 | `list` → `array`
`ARRAY` → `list` | `list` → `array(varchar)`
_TBD_ | +| `ARRAY` | Hive (HMS) | `array` → `array`
`ARRAY` → `array` | `array` → `array(varchar)`
_TBD_ | | `MAP` | Iceberg v2 | `map` → `map`
`MAP` → `map` | _TBD_
_TBD_ | | `MAP` | Hive (HMS) | `map` → `map`
`MAP` → `map` | _TBD_
_TBD_ | -| `STRUCT<…>` | Iceberg v2 | `struct` → `struct` ⚠
`STRUCT` → `struct` | _TBD_
_TBD_ | +| `STRUCT<…>` | Iceberg v2 | `struct` → `struct` ⚠ inner required dropped
`STRUCT` → `struct` | _TBD_
_TBD_ | | `STRUCT<…>` | Hive (HMS) | `struct` → `struct`
`STRUCT` → `struct` | _TBD_
_TBD_ | ## Union @@ -83,6 +106,6 @@ storage format's native API, read by the engine); the bottom line is | Logical type | Storage | Spark | Trino | |--------------------------|------------|----------------------------------------------------------------------------------------------------------------------------|------------| | `UNIONTYPE` | Iceberg v2 | —
— | —
— | -| `UNIONTYPE` | Hive (HMS) | `uniontype` → `int` ⚠
— | _TBD_
— | +| `UNIONTYPE` | Hive (HMS) | `uniontype` → `int` ⚠ degenerate flattening
— | _TBD_
— | | `UNIONTYPE` | Iceberg v2 | —
— | —
— | | `UNIONTYPE` | Hive (HMS) | `uniontype` → `struct` ⚠
— | _TBD_
— |