OH case-insensitive writes via OHSparkCatalog and OHWriteSchemaNormalizationRule#586
Open
pandaamit91 wants to merge 1 commit into
Open
OH case-insensitive writes via OHSparkCatalog and OHWriteSchemaNormalizationRule#586pandaamit91 wants to merge 1 commit into
pandaamit91 wants to merge 1 commit into
Conversation
…hemaNormalizationRule Writers (Spark SQL, DataFrame writeTo, Trino DML) may submit column names with different casing than what the OH table stores (e.g. "id" vs "ID"). With spark.sql.caseSensitive=true, Spark's ResolveOutputRelation rejects such writes with "Cannot find data for output column" before the OH server is reached. Fix (two-part): 1. OHSparkCatalog extends SparkCatalog and annotates every loaded OH table with TableCapability.ACCEPT_ANY_SCHEMA. This causes DataSourceV2Relation.skipSchemaResolution to return true, making V2WriteCommand.outputResolved true and causing ResolveOutputRelation to skip schema validation for OH write commands. 2. OHWriteSchemaNormalizationRule (injectPostHocResolutionRule) runs after all standard resolution rules. For each resolved V2WriteCommand targeting an OH relation, it inserts a Project node that renames source columns to match the stored column casing (matched by field ID). This ensures Iceberg sees the correct stored casing without mutating spark.sql.caseSensitive. Tables with case-duplicate columns (e.g. both "id" and "ID") are excluded from normalization — the target is ambiguous and writes must use exact casing. TestSparkSessionUtil and SparkTestBase are updated to use OHSparkCatalog instead of the bare SparkCatalog so all integration test sessions pick up the ACCEPT_ANY_SCHEMA capability. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
6ba4711 to
ed81f9a
Compare
aastha25
reviewed
May 14, 2026
Comment on lines
+36
to
+40
| */ | ||
| public class OHSparkCatalog extends SparkCatalog { | ||
|
|
||
| @Override | ||
| public SparkTable loadTable(Identifier ident) throws NoSuchTableException { |
There was a problem hiding this comment.
We already have V2 sparkCatalog interface implemented in the internal fork i think, or we did and needed to deramp it for an unrelated failure.
We would need to co-ordinate the the two set of changes / make them compatible
aastha25
reviewed
May 14, 2026
Comment on lines
+36
to
+38
| */ | ||
| class OHWriteSchemaNormalizationRule(spark: SparkSession) extends Rule[LogicalPlan] { | ||
|
|
There was a problem hiding this comment.
we should have instrumentation here, to get observability into where are the casing differences.
|
if we set ACCEPT_ANY_SCHEMA = true, do we still need defensive approach in the OH server side code changes to normalize schema? |
Contributor
Author
We would still need them for non-spark writes right? Like Iceberg Java API. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Writers (Spark SQL, DataFrame writeTo, Trino DML) may submit column names with different casing than what the OH table stores (e.g. "id" vs "ID"). With spark.sql.caseSensitive=true, Spark's ResolveOutputRelation rejects such writes with "Cannot find data for output column" before the OH server is reached.
Changes
Fix (two-part):
OHSparkCatalog extends SparkCatalog and annotates every loaded OH table with TableCapability.ACCEPT_ANY_SCHEMA. This causes DataSourceV2Relation.skipSchemaResolution to return true, making V2WriteCommand.outputResolved true and causing ResolveOutputRelation to skip schema validation for OH write commands.
OHWriteSchemaNormalizationRule (injectPostHocResolutionRule) runs after all standard resolution rules. For each resolved V2WriteCommand targeting an OH relation, it inserts a Project node that renames source columns to match the stored column casing (matched by field ID). This ensures Iceberg sees the correct stored casing without mutating spark.sql.caseSensitive.
Tables with case-duplicate columns (e.g. both "id" and "ID") are excluded from normalization — the target is ambiguous and writes must use exact casing.
Testing Done
TestSparkSessionUtil and SparkTestBase are updated to use OHSparkCatalog instead of the bare SparkCatalog so all integration test sessions pick up the ACCEPT_ANY_SCHEMA capability.
Additional Information
For all the boxes checked, include additional details of the changes made in this pull request.