diff --git a/docs/type-mapping-manifests/README.md b/docs/type-mapping-manifests/README.md new file mode 100644 index 000000000..79962afc2 --- /dev/null +++ b/docs/type-mapping-manifests/README.md @@ -0,0 +1,48 @@ +# Type Mapping Manifests + +This directory hosts versioned **type mapping manifests** that codify how +primitive, complex, and logical types map across the engines and table formats +supported by OpenHouse (currently Hive and Iceberg as storage formats; Spark +and Trino as engines, with HiveQL and PyArrow coverage in progress). + +Each manifest is a contract: a single source of truth for cross-engine type +semantics. Consumers (query planners, schema validators, type-system code +generators such as Calcite `TypeFactory` or Coral `CoralDataType`, and +authoring agents that inject design-time transformation shims) use it to keep +behavior consistent as dialects evolve. + +## Layout + +``` +type-mapping-manifests/ +├── README.md # this file +├── v1/ +│ └── README.md # the v1 manifest — table-presentation form +├── v2/ +│ └── README.md # future versions +``` + +Each version's manifest lives entirely in its `README.md`. The tables in that +file are the canonical contract — no separate machine-readable artifact ships +in v1. If future consumers require structured input (YAML, JSON), a generated +form will be added alongside the markdown. + +## Versioning policy + +- Each `vN/` directory is an **immutable release**. Once published, entries + are not edited in place; corrections ship as a new version. +- Versions are additive where possible. A new version may extend coverage to + new dialects, new dialect versions, or new types, and may revise the + mapping for an existing entry when behavioral evidence warrants it. +- Each version declares the exact dialect and format versions it covers in + its `README.md` header; consumers should pin to a specific manifest version. + +## How a version is produced + +Manifest entries are grounded in empirically observed engine behavior: +contract tables are materialized in each dialect via native APIs, and for +each table the persisted schema is compared against the schema inferred by +every reading engine. Divergences become manifest entries. + +The contract tables themselves are not published in this repository; only the +resulting manifest is. diff --git a/docs/type-mapping-manifests/v1/README.md b/docs/type-mapping-manifests/v1/README.md new file mode 100644 index 000000000..62f607e4b --- /dev/null +++ b/docs/type-mapping-manifests/v1/README.md @@ -0,0 +1,111 @@ +# Type Mapping Manifest — v1 + +Covers Hive 1.1 (HMS), Iceberg 1.2 (table format v2), Spark 3.1, Trino 400. +HiveQL and PyArrow not yet covered. + +In each engine cell, the top line is `native → engine` (column created via the +storage format's native API, read by the engine); the bottom line is +`engine → native` (column authored via engine DDL, persisted in storage). +`—` = no equivalent. `⚠` = silent divergence. `❌` = DDL rejected. +`_TBD_` = not yet observed. + +## Boolean + +| Logical type | Storage | Spark | Trino | +|--------------|------------|------------------------------------------------|------------------------------------------------| +| boolean | Iceberg v2 | `boolean` → `boolean`
`BOOLEAN` → `boolean` | `boolean` → `boolean`
`BOOLEAN` → `boolean` | +| boolean | Hive (HMS) | _TBD_
`BOOLEAN` → `boolean` | `boolean` → `boolean`
`BOOLEAN` → `boolean` | + +## Integers + +| Logical type | Storage | Spark | Trino | +|-------------------------|------------|----------------------------|----------------------------------------------------| +| 8-bit int (`tinyint`) | Iceberg v2 | —
_TBD_ | —
`TINYINT` → ❌ rejected | +| 8-bit int (`tinyint`) | Hive (HMS) | _TBD_
_TBD_ | `tinyint` → `tinyint`
`TINYINT` → `tinyint` | +| 16-bit int (`smallint`) | Iceberg v2 | —
_TBD_ | —
`SMALLINT` → ❌ rejected | +| 16-bit int (`smallint`) | Hive (HMS) | _TBD_
_TBD_ | `smallint` → `smallint`
`SMALLINT` → `smallint` | +| 32-bit int | Iceberg v2 | `int` → `int`
_TBD_ | `int` → `integer`
`INTEGER` → `int` | +| 32-bit int | Hive (HMS) | _TBD_
_TBD_ | `int` → `integer`
`INTEGER` → `int` | +| 64-bit int | Iceberg v2 | `long` → `bigint`
_TBD_ | `long` → `bigint`
`BIGINT` → `long` | +| 64-bit int | Hive (HMS) | _TBD_
_TBD_ | `bigint` → `bigint`
`BIGINT` → `bigint` | + +## Floats + +| Logical type | Storage | Spark | Trino | +|--------------|------------|--------------------------------------------|--------------------------------------------| +| 32-bit float | Iceberg v2 | `float` → `float`
`FLOAT` → `float` | `float` → `real`
`REAL` → `float` | +| 32-bit float | Hive (HMS) | _TBD_
`FLOAT` → `float` | `float` → `real`
`REAL` → `float` | +| 64-bit float | Iceberg v2 | `double` → `double`
`DOUBLE` → `double` | `double` → `double`
`DOUBLE` → `double` | +| 64-bit float | Hive (HMS) | _TBD_
`DOUBLE` → `double` | `double` → `double`
`DOUBLE` → `double` | + +## Decimal + +| Logical type | Storage | Spark | Trino | +|----------------|------------|--------------------------------------------|----------------| +| `decimal(p,s)` | Iceberg v2 | _TBD_
`DECIMAL(10,2)` → `decimal(10,2)` | _TBD_
_TBD_ | +| `decimal(p,s)` | Hive (HMS) | _TBD_
`DECIMAL(10,2)` → `decimal(10,2)` | _TBD_
_TBD_ | + +## Strings + +| Logical type | Storage | Spark | Trino | +|------------------------|------------|-----------------------------------------------|--------------------------------------------------------------| +| Variable-length string | Iceberg v2 | `string` → `string`
`STRING` → `string` | `string` → `varchar`
`VARCHAR` → `string` | +| Variable-length string | Hive (HMS) | _TBD_
`STRING` → `string` | `string` → `varchar`
`VARCHAR` → `string` | +| `VARCHAR(N)` | Iceberg v2 | —
`VARCHAR(100)` → `string` ⚠ bound erased | —
`VARCHAR(10)` → `string` ⚠ bound erased | +| `VARCHAR(N)` | Hive (HMS) | _TBD_
`VARCHAR(100)` → `varchar(100)` | `varchar(N)` → `varchar(N)`
`VARCHAR(10)` → `varchar(10)` | +| `CHAR(N)` | Iceberg v2 | —
`CHAR(10)` → `string` ⚠ bound erased | —
`CHAR(10)` → ❌ rejected | +| `CHAR(N)` | Hive (HMS) | _TBD_
`CHAR(10)` → `char(10)` | `char(N)` → `char(N)`
`CHAR(10)` → `char(10)` | + +## Binary and logical overlays + +| Logical type | Storage | Spark | Trino | +|-----------------|------------|-------------------------------------------------|----------------------------------------------------| +| Variable binary | Iceberg v2 | `binary` → `binary`
`BINARY` → `binary` | `binary` → `varbinary`
`VARBINARY` → `binary` | +| Variable binary | Hive (HMS) | _TBD_
`BINARY` → `binary` | `binary` → `varbinary`
`VARBINARY` → `binary` | +| `fixed(N)` | Iceberg v2 | `fixed(16)` → `binary` ⚠ length erased
_TBD_ | `fixed(16)` → `varbinary` ⚠ length erased
_TBD_ | +| `fixed(N)` | Hive (HMS) | —
— | —
— | +| `uuid` | Iceberg v2 | _TBD_
_TBD_ | `uuid` → `uuid`
_TBD_ | +| `uuid` | Hive (HMS) | —
— | —
— | + +## Date and time + +| Logical type | Storage | Spark | Trino | +|---------------------|------------|-----------------------------------------------------------------------------|-----------------------------------------------------------| +| `date` | Iceberg v2 | _TBD_
`DATE` → `date` | _TBD_
`DATE` → `date` | +| `date` | Hive (HMS) | _TBD_
`DATE` → `date` | _TBD_
`DATE` → `date` | +| `time` | Iceberg v2 | _TBD_
_TBD_ | `time` → `time(6)`
_TBD_ | +| `time` | Hive (HMS) | —
— | —
— | +| `timestamp` (no TZ) | Iceberg v2 | _TBD_
⚠ no Spark DDL path (TIMESTAMP silently produces `timestamptz(6)`) | `timestamp(6)` → `timestamp(6)`
_TBD_ | +| `timestamp` (no TZ) | Hive (HMS) | _TBD_
`TIMESTAMP` → `timestamp` | `timestamp` → `timestamp(3)`
_TBD_ | +| `timestamp` with TZ | Iceberg v2 | _TBD_
`TIMESTAMP` → `timestamptz(6)` ⚠ silent TZ injection | `timestamptz(6)` → `timestamp(6) with time zone`
_TBD_ | +| `timestamp` with TZ | Hive (HMS) | —
— | —
— | + +## Trino-only types + +These types exist in Trino's type system but have no equivalent in either Hive +or Iceberg storage; DDL targeting either format is rejected at creation time. + +| Logical type | Storage | Spark | Trino | +|--------------|---------|-------|-------------------------------| +| `json` | — | — | —
`JSON` → ❌ rejected | +| `ipaddress` | — | — | —
`IPADDRESS` → ❌ rejected | + +## Containers + +| Logical type | Storage | Spark | Trino | +|---|---|---|---| +| `ARRAY` | Iceberg v2 | `list` → `array`
`ARRAY` → `list` | `list` → `array(varchar)`
_TBD_ | +| `ARRAY` | Hive (HMS) | `array` → `array`
`ARRAY` → `array` | `array` → `array(varchar)`
_TBD_ | +| `MAP` | Iceberg v2 | `map` → `map`
`MAP` → `map` | _TBD_
_TBD_ | +| `MAP` | Hive (HMS) | `map` → `map`
`MAP` → `map` | _TBD_
_TBD_ | +| `STRUCT<…>` | Iceberg v2 | `struct` → `struct` ⚠ inner required dropped
`STRUCT` → `struct` | _TBD_
_TBD_ | +| `STRUCT<…>` | Hive (HMS) | `struct` → `struct`
`STRUCT` → `struct` | _TBD_
_TBD_ | + +## Union + +| Logical type | Storage | Spark | Trino | +|--------------------------|------------|----------------------------------------------------------------------------------------------------------------------------|------------| +| `UNIONTYPE` | Iceberg v2 | —
— | —
— | +| `UNIONTYPE` | Hive (HMS) | `uniontype` → `int` ⚠ degenerate flattening
— | _TBD_
— | +| `UNIONTYPE` | Iceberg v2 | —
— | —
— | +| `UNIONTYPE` | Hive (HMS) | `uniontype` → `struct` ⚠
— | _TBD_
— |