Skip to content

Allow custom check messages#1092

Open
ghanse wants to merge 8 commits intomainfrom
ghanse/issue-958-custom-check-messages
Open

Allow custom check messages#1092
ghanse wants to merge 8 commits intomainfrom
ghanse/issue-958-custom-check-messages

Conversation

@ghanse
Copy link
Copy Markdown
Collaborator

@ghanse ghanse commented Mar 20, 2026

Changes

Add an optional message callable parameter to DQRule, DQRowRule, DQDatasetRule, and DQForEachColRule that allows users to define custom check failure messages. The callable receives rule context (rule_name, check_func_name, check_func_args, column_value) and returns a Spark Column expression, enabling dynamic messages that can include column values.

When message is None (the default), the existing auto-generated message behavior is preserved. When provided, the custom message replaces the default message for failed rows while maintaining null/non-null semantics for passing rows.

Linked issues

Resolves #958

Tests

  • manually tested
  • added unit tests
  • added integration tests

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 20, 2026

❌ no tests were run

Running from acceptance #4350

@ghanse ghanse changed the title feat: add message callable to DQRule for custom check messages Allow custom check messages Mar 21, 2026
@mwojtyczka mwojtyczka self-requested a review March 30, 2026 19:10
Copy link
Copy Markdown
Contributor

@mwojtyczka mwojtyczka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Allow custom check messages

Clean feature addition. The inspect-based argument dispatching is flexible and the null-safety handling is correct. A few items below.

Comment thread src/databricks/labs/dqx/manager.py Outdated

* *rule_name* is the name of the DQX rule
* *check_func_name* is the name of the DQX check function
* *check_func_args* is a dictionary of DQX check function arguments as key-value pairs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MEDIUM - inspect.signature called at execution time per row-group: _build_message_col calls inspect.signature twice (on check_func and message) every time a result struct is built. While this runs once per check (not per row), it's still a hot path during DataFrame construction. Consider caching the signatures on the DQRule or DQRuleManager instance.

Also, bind_partial can raise TypeError if args don't match — should this be wrapped with error handling that gives a clear message about the check function signature mismatch?

* *column_value* is the column value that failed the DQX check

Args:
condition: Default DQX condition message returned by evaluating the DQX check function
Copy link
Copy Markdown
Contributor

@mwojtyczka mwojtyczka Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When self.check.column is None but self.check.columns is set (e.g., a multi-column row rule), column_value falls back to F.lit(None). The message function receives no useful value to display.
Either document this or pass all column values as a struct. Then always pass a struct, even for single-column rules:

  if self.check.column:                                                                                                   
      column_value = F.struct(F.col(self.check.column).cast("string").alias(self.check.column))                           
  elif self.check.columns:                                                                                                
      column_value = F.struct(*[F.col(c).cast("string").alias(c) for c in self.check.columns])                            
  else:                                                                                                                   
      column_value = F.lit(None)            

Passing as struct would be more flexible, but it will add significant complexity for users when defining the message func so maybe just document this.

Comment thread src/databricks/labs/dqx/manager.py Outdated
message: Callable[..., Column] | None = None

def __post_init__(self):
self._validate_rule_type(self.check_func)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MEDIUM - Type annotation: The field is message: Callable[..., Column] | None = None here on DQRule, but on DQForEachColRule (line 441) it's message: Callable | None = None without the return type. These should be consistent — use Callable[..., Column] | None in both places.

Comment thread src/databricks/labs/dqx/rule.py Outdated
* *user_metadata* (optional) - User-defined key-value pairs added to metadata generated by the check.
* *custom_message_func* - A user-defined function that returns a message when the check fails. The function should
return a Spark Column and can optionally accept the following keyword arguments:
- *rule_name* is the name of the DQX rule
Copy link
Copy Markdown
Contributor

@mwojtyczka mwojtyczka Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOW - Docstring says custom_message_func but field is named message: The docstring says custom_message_func but the actual field added is message. These should match to avoid confusion.

Comment thread tests/unit/test_custom_messages.py Outdated
assert rule.message is None


def test_dq_dataset_rule_accepts_message_callable():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOW - Misleading test name: test_dq_dataset_rule_accepts_message_callable creates a DQRowRule, not a DQDatasetRule. Either rename the test or actually test DQDatasetRule.

Comment thread tests/integration/test_custom_messages.py Outdated

from typing import Any

import pyspark.sql.functions as F
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test: No test for DQDatasetRule with a custom message. Dataset-level rules have different column semantics (no single column attribute), so the column_value fallback to F.lit(None) should be verified.

@@ -167,6 +170,52 @@ def _build_result_struct(self, condition: Column, skipped: bool = False) -> Colu
F.lit(skipped or None).alias("skipped"),
Copy link
Copy Markdown
Contributor

@mwojtyczka mwojtyczka Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should support custom message for both programmatic and metadata definition.

Option 1: Template strings (simplest)

Support a message field in YAML as a template string with placeholders:

  - name: email_not_null                                                                                                  
    criticality: error                                                                                                    
    check:                                                                                                                
      function: is_not_null                                                                                               
      arguments:                                                                                                          
        column: email                                                                                                     
    message: "Rule '{rule_name}' failed: {check_func_name} on column '{column}'"               

DQX resolves placeholders at runtime using Python's str.format() with the same context args (rule_name, check_func_name, check_func_args, etc.) and wraps the result in F.lit(...).

This covers static/semi-dynamic messages but can't include column_value (which is a Spark Column expression, not a Python string at template time).

Option 2: Template string + column_value via Spark concat

To support column_value in templates, use a special placeholder that gets expanded into a Spark concat expression: message: "Rule '{rule_name}' failed for value=${column_value}"

DQX parses the template, splits on ${column_value}, and builds:

  F.concat(                                                                                                               
      F.lit("Rule 'email_not_null' failed for value="),                                                                   
      F.coalesce(F.col("email").cast("string"), F.lit("null"))                                                            
  )              

This gives dynamic per-row messages from YAML.

Option 3: SQL expression - I would prefer this one

Allow the message to be a SQL expression string: ``message_expr: "concat('Failed: ', coalesce(cast(email as string), 'null'))" `

DQX wraps it with F.expr(...). Most flexible, and could unify both approaches

Programmatic

  DQRowRule(                                                                                                              
      check_func=check_funcs.is_not_null,                                                                                 
      column="email",                                                                                                     
      message="concat('Rule failed for value=', coalesce(cast(email as string), 'null'))"                                 
  )                   

YAML

  - name: email_not_null                                                                                                  
    check:                                                                                                                
      function: is_not_null                                                                                               
      arguments:                                                                                                          
        column: email                                                                                                     
    message: "concat('Rule failed for value=', coalesce(cast(email as string), 'null'))"                    

Both resolve to F.expr(message_str) internally. No inspect.signature, no argument dispatching, no callable serialization problem.

column="email",
message=custom_message,
)
]
Copy link
Copy Markdown
Contributor

@mwojtyczka mwojtyczka Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be worth to showcase how the exact message as defined in the example looks like when it fails

assert_df_equality(checked_df, expected_df)


def test_apply_checks_without_custom_message_unchanged(ws, spark):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this test? we have a lot of regression tests

@mwojtyczka mwojtyczka added the under-review This PR is currently being reviewed by one of DQX maintainers. label Mar 31, 2026
Copy link
Copy Markdown
Contributor

@mwojtyczka mwojtyczka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main point is around supporting this with metadata definition

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 15, 2026

❌ no tests were run

Running from anomaly #464

@mwojtyczka mwojtyczka added needs-review Ready for re-review needs-changes Changes required after review and removed needs-review Ready for re-review under-review This PR is currently being reviewed by one of DQX maintainers. labels Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-changes Changes required after review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Enable custom message for checks

2 participants