Conversation
Co-authored-by: Isaac
|
❌ no tests were run Running from acceptance #4350 |
mwojtyczka
left a comment
There was a problem hiding this comment.
Review: Allow custom check messages
Clean feature addition. The inspect-based argument dispatching is flexible and the null-safety handling is correct. A few items below.
|
|
||
| * *rule_name* is the name of the DQX rule | ||
| * *check_func_name* is the name of the DQX check function | ||
| * *check_func_args* is a dictionary of DQX check function arguments as key-value pairs |
There was a problem hiding this comment.
MEDIUM - inspect.signature called at execution time per row-group: _build_message_col calls inspect.signature twice (on check_func and message) every time a result struct is built. While this runs once per check (not per row), it's still a hot path during DataFrame construction. Consider caching the signatures on the DQRule or DQRuleManager instance.
Also, bind_partial can raise TypeError if args don't match — should this be wrapped with error handling that gives a clear message about the check function signature mismatch?
| * *column_value* is the column value that failed the DQX check | ||
|
|
||
| Args: | ||
| condition: Default DQX condition message returned by evaluating the DQX check function |
There was a problem hiding this comment.
When self.check.column is None but self.check.columns is set (e.g., a multi-column row rule), column_value falls back to F.lit(None). The message function receives no useful value to display.
Either document this or pass all column values as a struct. Then always pass a struct, even for single-column rules:
if self.check.column:
column_value = F.struct(F.col(self.check.column).cast("string").alias(self.check.column))
elif self.check.columns:
column_value = F.struct(*[F.col(c).cast("string").alias(c) for c in self.check.columns])
else:
column_value = F.lit(None)
Passing as struct would be more flexible, but it will add significant complexity for users when defining the message func so maybe just document this.
| message: Callable[..., Column] | None = None | ||
|
|
||
| def __post_init__(self): | ||
| self._validate_rule_type(self.check_func) |
There was a problem hiding this comment.
MEDIUM - Type annotation: The field is message: Callable[..., Column] | None = None here on DQRule, but on DQForEachColRule (line 441) it's message: Callable | None = None without the return type. These should be consistent — use Callable[..., Column] | None in both places.
| * *user_metadata* (optional) - User-defined key-value pairs added to metadata generated by the check. | ||
| * *custom_message_func* - A user-defined function that returns a message when the check fails. The function should | ||
| return a Spark Column and can optionally accept the following keyword arguments: | ||
| - *rule_name* is the name of the DQX rule |
There was a problem hiding this comment.
LOW - Docstring says custom_message_func but field is named message: The docstring says custom_message_func but the actual field added is message. These should match to avoid confusion.
| assert rule.message is None | ||
|
|
||
|
|
||
| def test_dq_dataset_rule_accepts_message_callable(): |
There was a problem hiding this comment.
LOW - Misleading test name: test_dq_dataset_rule_accepts_message_callable creates a DQRowRule, not a DQDatasetRule. Either rename the test or actually test DQDatasetRule.
|
|
||
| from typing import Any | ||
|
|
||
| import pyspark.sql.functions as F |
There was a problem hiding this comment.
Missing test: No test for DQDatasetRule with a custom message. Dataset-level rules have different column semantics (no single column attribute), so the column_value fallback to F.lit(None) should be verified.
| @@ -167,6 +170,52 @@ def _build_result_struct(self, condition: Column, skipped: bool = False) -> Colu | |||
| F.lit(skipped or None).alias("skipped"), | |||
There was a problem hiding this comment.
We should support custom message for both programmatic and metadata definition.
Option 1: Template strings (simplest)
Support a message field in YAML as a template string with placeholders:
- name: email_not_null
criticality: error
check:
function: is_not_null
arguments:
column: email
message: "Rule '{rule_name}' failed: {check_func_name} on column '{column}'"
DQX resolves placeholders at runtime using Python's str.format() with the same context args (rule_name, check_func_name, check_func_args, etc.) and wraps the result in F.lit(...).
This covers static/semi-dynamic messages but can't include column_value (which is a Spark Column expression, not a Python string at template time).
Option 2: Template string + column_value via Spark concat
To support column_value in templates, use a special placeholder that gets expanded into a Spark concat expression: message: "Rule '{rule_name}' failed for value=${column_value}"
DQX parses the template, splits on ${column_value}, and builds:
F.concat(
F.lit("Rule 'email_not_null' failed for value="),
F.coalesce(F.col("email").cast("string"), F.lit("null"))
)
This gives dynamic per-row messages from YAML.
Option 3: SQL expression - I would prefer this one
Allow the message to be a SQL expression string: ``message_expr: "concat('Failed: ', coalesce(cast(email as string), 'null'))" `
DQX wraps it with F.expr(...). Most flexible, and could unify both approaches
Programmatic
DQRowRule(
check_func=check_funcs.is_not_null,
column="email",
message="concat('Rule failed for value=', coalesce(cast(email as string), 'null'))"
)
YAML
- name: email_not_null
check:
function: is_not_null
arguments:
column: email
message: "concat('Rule failed for value=', coalesce(cast(email as string), 'null'))"
Both resolve to F.expr(message_str) internally. No inspect.signature, no argument dispatching, no callable serialization problem.
| column="email", | ||
| message=custom_message, | ||
| ) | ||
| ] |
There was a problem hiding this comment.
may be worth to showcase how the exact message as defined in the example looks like when it fails
| assert_df_equality(checked_df, expected_df) | ||
|
|
||
|
|
||
| def test_apply_checks_without_custom_message_unchanged(ws, spark): |
There was a problem hiding this comment.
do we need this test? we have a lot of regression tests
mwojtyczka
left a comment
There was a problem hiding this comment.
Main point is around supporting this with metadata definition
|
❌ no tests were run Running from anomaly #464 |
Changes
Add an optional
messagecallable parameter toDQRule,DQRowRule,DQDatasetRule, andDQForEachColRulethat allows users to define custom check failure messages. The callable receives rule context (rule_name,check_func_name,check_func_args,column_value) and returns a Spark Column expression, enabling dynamic messages that can include column values.When
messageisNone(the default), the existing auto-generated message behavior is preserved. When provided, the custom message replaces the default message for failed rows while maintaining null/non-null semantics for passing rows.Linked issues
Resolves #958
Tests