Skip to content

translators: add Apache Iceberg schema translator #100

@gummiorri

Description

@gummiorri

Problem

Iceberg is the dominant open table format adjacent to every existing Spark/Databricks target in daco, but there is no translator for it. Iceberg uses its own JSON schema serialization with explicit, mandatory field IDs (monotonic, deterministic) and v3-only types (variant, geometry, geography, timestamp_ns, timestamptz_ns, unknown) — none of which are emitted by the existing databrickssql / sparksql / databrickspyspark translators.

A user authoring an OpenDPI port today cannot generate an Iceberg schema; they have to translate to Spark SQL DDL and lose Iceberg-specific information (field IDs, v3 types).

Proposed change

New package internal/translate/iceberg/ following the avro pattern (resolver + JSON marshal in Translate, no text/template — Iceberg schemas are structured JSON):

  • translator.go — implements translate.Translator. FileExtension returns .json. Translate calls translate.Prepare(...) then marshals to the Iceberg schema JSON shape, assigning field IDs sequentially in property order (Prepare already preserves that order, so output is deterministic across runs).
  • resolver.go — implements translate.TypeResolver:
    • PrimitiveType: stringstring, integerlong (narrowed in EnrichField via Constraints.Minimum/Maximum to int where it fits), numberdouble (or decimal(P,S) when Constraints.MultipleOf is a decimal fraction), booleanboolean, format:datedate, format:date-timetimestamptz, format:timetime, format:uuiduuid.
    • ArrayType(elem)list<elem> (marker form, materialized in Translate).
    • MapType(k,v)map<k,v> (marker form).
    • RefType/FormatDefNamePascalCase(defName), must agree (per .claude/rules/translators.md).
  • EnrichField: integer narrowing (lift inferIntegerType from databrickspyspark/resolver.go into shared internal/translate so it isn't duplicated); decimal precision/scale from MultipleOf (lift computeDecimalScale/computeDecimalPrecision similarly).
  • Field IDs assigned in Translate via a counter threaded through marshal — IDs go in data.Extra if needed but are simplest computed inline at marshal time. Prepare/SchemaData shape unchanged.
  • Register in cmd/daco/internal/app.go registerTranslators as iceberg.

V3-only types (variant, geometry, geography, timestamp_ns) are out of scope for the initial PR — JSON Schema doesn't natively express them, so they need a daco-side hint mechanism that should be designed separately. The translator should emit v2-compatible output by default.

References

Test cases

Following the shape in internal/translate/pyspark/translator_test.go:

  1. Simple object — sequential field IDs and root naming

    Input: {type:object, properties:{name:{type:string}, age:{type:integer}}}

    Expected (substring asserts):

    {
      "type": "struct",
      "schema-id": 0,
      "fields": [
        {"id": 1, "name": "name", "required": false, "type": "string"},
        {"id": 2, "name": "age", "required": false, "type": "long"}
      ]
    }
  2. Required vs optionalrequired: ["name"]"required": true for name, false for age.

  3. Decimal from multipleOf{type:number, multipleOf:0.01, minimum:0, maximum:99999.99}"type": "decimal(7, 2)".

  4. Integer narrowing{type:integer, minimum:-128, maximum:127}"type": "int" (Iceberg has no smaller int; narrows from long).

  5. Date/time/uuid formatsformat:date"type": "date"; format:date-time"type": "timestamptz"; format:uuid"type": "uuid".

  6. Arrays{type:array, items:{type:string}}"type": {"type": "list", "element-id": N, "element": "string", "element-required": ...}.

  7. $ref + $defs — verifies RefType/FormatDefName agreement: a referenced def is emitted as a nested struct with its own field IDs continuing the global counter.

  8. Inline nested object — auto-extracted by Prepare to a synthetic def named after the field in PascalCase; emitted as a nested struct with continued IDs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions