OH case-insensitive writes via OHSparkCatalog and OHWriteSchemaNormalizationRule by pandaamit91 · Pull Request #586 · linkedin/openhouse

pandaamit91 · 2026-05-14T21:57:40Z

Summary

Writers (Spark SQL, DataFrame writeTo, Trino DML) may submit column names with different casing than what the OH table stores (e.g. "id" vs "ID"). With spark.sql.caseSensitive=true, Spark's ResolveOutputRelation rejects such writes with "Cannot find data for output column" before the OH server is reached.

Changes

Fix (two-part):

OHSparkCatalog extends SparkCatalog and annotates every loaded OH table with TableCapability.ACCEPT_ANY_SCHEMA. This causes DataSourceV2Relation.skipSchemaResolution to return true, making V2WriteCommand.outputResolved true and causing ResolveOutputRelation to skip schema validation for OH write commands.
OHWriteSchemaNormalizationRule (injectPostHocResolutionRule) runs after all standard resolution rules. For each resolved V2WriteCommand targeting an OH relation, it inserts a Project node that renames source columns to match the stored column casing (matched by field ID). This ensures Iceberg sees the correct stored casing without mutating spark.sql.caseSensitive.

Tables with case-duplicate columns (e.g. both "id" and "ID") are excluded from normalization — the target is ambiguous and writes must use exact casing.

Testing Done

Manually Tested on local docker setup. Please include commands ran, and their output.
Added new tests for the changes made.
Updated existing tests to reflect the changes made.
No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
Some other form of testing like staging or soak time in production. Please explain.

TestSparkSessionUtil and SparkTestBase are updated to use OHSparkCatalog instead of the bare SparkCatalog so all integration test sessions pick up the ACCEPT_ANY_SCHEMA capability.

Additional Information

Breaking Changes
Deprecations
Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

…hemaNormalizationRule Writers (Spark SQL, DataFrame writeTo, Trino DML) may submit column names with different casing than what the OH table stores (e.g. "id" vs "ID"). With spark.sql.caseSensitive=true, Spark's ResolveOutputRelation rejects such writes with "Cannot find data for output column" before the OH server is reached. Fix (two-part): 1. OHSparkCatalog extends SparkCatalog and annotates every loaded OH table with TableCapability.ACCEPT_ANY_SCHEMA. This causes DataSourceV2Relation.skipSchemaResolution to return true, making V2WriteCommand.outputResolved true and causing ResolveOutputRelation to skip schema validation for OH write commands. 2. OHWriteSchemaNormalizationRule (injectPostHocResolutionRule) runs after all standard resolution rules. For each resolved V2WriteCommand targeting an OH relation, it inserts a Project node that renames source columns to match the stored column casing (matched by field ID). This ensures Iceberg sees the correct stored casing without mutating spark.sql.caseSensitive. Tables with case-duplicate columns (e.g. both "id" and "ID") are excluded from normalization — the target is ambiguous and writes must use exact casing. TestSparkSessionUtil and SparkTestBase are updated to use OHSparkCatalog instead of the bare SparkCatalog so all integration test sessions pick up the ACCEPT_ANY_SCHEMA capability. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

aastha25 · 2026-05-14T22:20:28Z

+ */
+public class OHSparkCatalog extends SparkCatalog {
+
+  @Override
+  public SparkTable loadTable(Identifier ident) throws NoSuchTableException {


We already have V2 sparkCatalog interface implemented in the internal fork i think, or we did and needed to deramp it for an unrelated failure.
We would need to co-ordinate the the two set of changes / make them compatible

aastha25 · 2026-05-14T22:26:25Z

+ */
+class OHWriteSchemaNormalizationRule(spark: SparkSession) extends Rule[LogicalPlan] {
+


we should have instrumentation here, to get observability into where are the casing differences.

aastha25 · 2026-05-14T22:26:53Z

if we set ACCEPT_ANY_SCHEMA = true, do we still need defensive approach in the OH server side code changes to normalize schema?

pandaamit91 · 2026-05-14T22:35:39Z

if we set ACCEPT_ANY_SCHEMA = true, do we still need defensive approach in the OH server side code changes to normalize schema?

We would still need them for non-spark writes right? Like Iceberg Java API.

pandaamit91 force-pushed the ampanda/oh-case-insensitive-writes branch from 6ba4711 to ed81f9a Compare May 14, 2026 22:07

aastha25 reviewed May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OH case-insensitive writes via OHSparkCatalog and OHWriteSchemaNormalizationRule#586

OH case-insensitive writes via OHSparkCatalog and OHWriteSchemaNormalizationRule#586
pandaamit91 wants to merge 1 commit into
linkedin:mainfrom
pandaamit91:ampanda/oh-case-insensitive-writes

pandaamit91 commented May 14, 2026

Uh oh!

aastha25 May 14, 2026

Uh oh!

aastha25 May 14, 2026

Uh oh!

aastha25 commented May 14, 2026

Uh oh!

pandaamit91 commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		*/
		class OHWriteSchemaNormalizationRule(spark: SparkSession) extends Rule[LogicalPlan] {

Conversation

pandaamit91 commented May 14, 2026

Summary

Changes

Testing Done

Additional Information

Uh oh!

aastha25 May 14, 2026

Choose a reason for hiding this comment

Uh oh!

aastha25 May 14, 2026

Choose a reason for hiding this comment

Uh oh!

aastha25 commented May 14, 2026

Uh oh!

pandaamit91 commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants