Skip to content

OH case-insensitive writes via OHSparkCatalog and OHWriteSchemaNormalizationRule#586

Open
pandaamit91 wants to merge 1 commit into
linkedin:mainfrom
pandaamit91:ampanda/oh-case-insensitive-writes
Open

OH case-insensitive writes via OHSparkCatalog and OHWriteSchemaNormalizationRule#586
pandaamit91 wants to merge 1 commit into
linkedin:mainfrom
pandaamit91:ampanda/oh-case-insensitive-writes

Conversation

@pandaamit91
Copy link
Copy Markdown
Contributor

Summary

Writers (Spark SQL, DataFrame writeTo, Trino DML) may submit column names with different casing than what the OH table stores (e.g. "id" vs "ID"). With spark.sql.caseSensitive=true, Spark's ResolveOutputRelation rejects such writes with "Cannot find data for output column" before the OH server is reached.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • [] Tests

Fix (two-part):

  1. OHSparkCatalog extends SparkCatalog and annotates every loaded OH table with TableCapability.ACCEPT_ANY_SCHEMA. This causes DataSourceV2Relation.skipSchemaResolution to return true, making V2WriteCommand.outputResolved true and causing ResolveOutputRelation to skip schema validation for OH write commands.

  2. OHWriteSchemaNormalizationRule (injectPostHocResolutionRule) runs after all standard resolution rules. For each resolved V2WriteCommand targeting an OH relation, it inserts a Project node that renames source columns to match the stored column casing (matched by field ID). This ensures Iceberg sees the correct stored casing without mutating spark.sql.caseSensitive.

Tables with case-duplicate columns (e.g. both "id" and "ID") are excluded from normalization — the target is ambiguous and writes must use exact casing.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

TestSparkSessionUtil and SparkTestBase are updated to use OHSparkCatalog instead of the bare SparkCatalog so all integration test sessions pick up the ACCEPT_ANY_SCHEMA capability.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

…hemaNormalizationRule

Writers (Spark SQL, DataFrame writeTo, Trino DML) may submit column names
with different casing than what the OH table stores (e.g. "id" vs "ID").
With spark.sql.caseSensitive=true, Spark's ResolveOutputRelation rejects such
writes with "Cannot find data for output column" before the OH server is reached.

Fix (two-part):

1. OHSparkCatalog extends SparkCatalog and annotates every loaded OH table
   with TableCapability.ACCEPT_ANY_SCHEMA. This causes
   DataSourceV2Relation.skipSchemaResolution to return true, making
   V2WriteCommand.outputResolved true and causing ResolveOutputRelation to
   skip schema validation for OH write commands.

2. OHWriteSchemaNormalizationRule (injectPostHocResolutionRule) runs after
   all standard resolution rules. For each resolved V2WriteCommand targeting
   an OH relation, it inserts a Project node that renames source columns to
   match the stored column casing (matched by field ID). This ensures Iceberg
   sees the correct stored casing without mutating spark.sql.caseSensitive.

Tables with case-duplicate columns (e.g. both "id" and "ID") are excluded
from normalization — the target is ambiguous and writes must use exact casing.

TestSparkSessionUtil and SparkTestBase are updated to use OHSparkCatalog
instead of the bare SparkCatalog so all integration test sessions pick up
the ACCEPT_ANY_SCHEMA capability.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@pandaamit91 pandaamit91 force-pushed the ampanda/oh-case-insensitive-writes branch from 6ba4711 to ed81f9a Compare May 14, 2026 22:07
Comment on lines +36 to +40
*/
public class OHSparkCatalog extends SparkCatalog {

@Override
public SparkTable loadTable(Identifier ident) throws NoSuchTableException {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have V2 sparkCatalog interface implemented in the internal fork i think, or we did and needed to deramp it for an unrelated failure.
We would need to co-ordinate the the two set of changes / make them compatible

Comment on lines +36 to +38
*/
class OHWriteSchemaNormalizationRule(spark: SparkSession) extends Rule[LogicalPlan] {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should have instrumentation here, to get observability into where are the casing differences.

@aastha25
Copy link
Copy Markdown

if we set ACCEPT_ANY_SCHEMA = true, do we still need defensive approach in the OH server side code changes to normalize schema?

@pandaamit91
Copy link
Copy Markdown
Contributor Author

if we set ACCEPT_ANY_SCHEMA = true, do we still need defensive approach in the OH server side code changes to normalize schema?

We would still need them for non-spark writes right? Like Iceberg Java API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants