Skip to content

Releases: databrickslabs/dqx

v0.13.0

09 Feb 17:18
99734cd

Choose a tag to compare

What's Changed

  • New DQX Data Quality Dashboard (#1019). The data quality dashboard has been significantly enhanced to provide a centralized view of data quality metrics across all tables, allowing users to monitor and track data quality issues with greater ease. The dashboard now consists of three tabs - Data Quality Summary, Data Quality by Table (Time Series), and Data Quality by Table (Full Snapshot) - each catering to different monitoring scenarios, and offers customizable parameters for reporting column names and filtering tables with data quality issues. Additionally, the installation process for the dashboard has been simplified, with options to import it directly to a Workspace or deploy it automatically using the Databricks CLI.
  • DQX App Skeleton (#982). The DQX application (frontend and backend) has been built with a core set of features, including configuration management and AI-assisted rule generation based on natural-language input from users. A comprehensive README documents the application architecture as well as development and deployment workflows. Future versions of DQX will introduce additional functionality (loading/saving rules, rules authoring in graphical form) and provide a streamlined, user-friendly way to deploy the application directly into a Databricks workspace.
  • Added Decimal support to check functions and to min_max generator (#1013) (#1017). The data quality checks have been enhanced to support Python's Decimal type, in addition to int and float, for min/max validation checks, enabling proper data quality checks for decimal-precise financial and scientific data where floating-point precision issues would cause false positives.
  • Added DQX produciton best practices and fix datetime limit handling (#997). Practical guidance and best practices for using DQX in production have been added, covering aspects such as storing checks in Delta tables, enforcing access controls, and optimizing rules for performance and scalability. Fixes have also been implemented to address issues related to handling date and datetime limits, particularly when provided as strings.
  • Added new row-level check functions: is_null, is_empty, and is_null_or_empty (#1015). DQX now includes three new check functions, is_null, is_empty, and is_null_or_empty, which enable verification of column values as null, empty strings, or both, complementing existing checks like is_not_null, is_not_empty, and is_not_null_and_not_empty. The functions also support optional arguments, like trim_strings to trim spaces from strings.
  • Added tolerance to equality and non-equality check functions (#1011). The library's quality check functionality has been enhanced to support absolute and relative tolerance parameters for numeric value comparisons in is_equal_to, is_not_equal_to, is_aggr_equal and is_aggr_not_equal checks, allowing for more flexible and precise control over data validation. The introduction of tolerance logic, which checks for absolute and relative differences within specified thresholds via abs_tolerance and rel_tolerance parameters, provides more nuanced comparisons for numeric data.
  • Allow new lines in sql expression checks (#1009). SQL expression check function (sql_expression) has been updated to support new lines in its expression argument, allowing for more complex and formatted SQL expressions.
  • Allow summary metrics with SparkConnect sessions (#1000). The library now supports writing summary metrics directly to a table with SparkConnect sessions, eliminating the need for a classic compute cluster in Dedicated access mode. This change lifts the previous restriction and enables generatic summary metrics using Serverless and all standard clusters with Databricks Runtime 17.3LTS or higeher.
  • Fixed loading checks from a delta table with special characters (#992). The loading checks functionality from a delta table has been fixed to handle special characters in the fully qualified table.
  • Fixed resolution of pii detection check function (#1003). The PII detection check function resolution has been enhanced to support the application of checks defined as metadata (YAML).
  • Fixed serialization/deserialization of row filter parameter for dataset-level rules (#1021). The filter field in checks definition now correctly pushes down the filter condition defined at the check-level as row_filter to the check function, allowing checks to operate on the relevant subset of rows before aggregation. The documentation has been updated to advice users to use op-level filter condition for consistency instead of row_filter parameter. Overall, these changes aim to enhance the overall user experience.
  • Improved Lakeflow Declarative Pipeline tests (#1010). The Lakeflow Declarative Pipeline (LDP) tests have been enhanced to utilize full Unity Catalog mode, enabling support for writing to arbitrary catalogs and schemas, and performing additional checks to prevent certain operations.
  • Updated Lakebase authentication method (#975). The Lakebase authentication method has been updated to utilize a client ID instead of a username, simplifying its use in the context of a Databricks App. The lakebase_user parameter has been replaced with lakebase_client_id, an optional service principal client ID used to connect to Lakebase, defaulting to the caller's identity if not provided. This change enhances the security and reliability of the authentication process, making it easier to work with Lakebase as a checks storage.
  • Updated handling of metadata columns during schema validation (#1002). The has_valid_schema check has been enhanced to provide more flexibility in schema validation by introducing an optional exclude_columns parameter, allowing users to specify columns to ignore during validation. This parameter can be used to exclude metadata columns or other columns not relevant to schema validation, and it takes precedence over the columns list.
  • Updated product info when missing in config while verifying workspace client (#987). The workspace client configuration has been enhanced to default product information to dqx with the current version when it is missing, ensuring that product information is always set for telemetry purposes.
  • Updated profiler and generator documentation (#1026). The data profiling and quality checks generation feature has been enhanced with updated documentation, providing reference information for data quality profile types and associated rules.
  • Added filter attribute in rules generated from ODCS (#978). The rules generation process has been enhanced with the introduction of a filter attribute in rules generated from Open Data Contract Standard (ODCS), allowing for more flexible and targeted rules creation.

Contributors

@mwojtyczka @ghanse @alexott @nehamilak-db @cornzyblack @laurencewells @renardeinside @tlgnr @pierre-monnet @sheeluvikas @ashwin-911 @dwanneruchi @bpm1993 @Jgprog117

Full Changelog: v0.12.0...v0.13.0

v0.12.0

20 Dec 00:13
fff89a7

Choose a tag to compare

What's Changed

  • AI-Assisted rules generation from data profiles (#963). AI-assisted data quality rule generation was added, leveraging summary statistics from a profiler to create rules. The DQGenerator class includes a generate_dq_rules_ai_assisted method that can generate rules with or without user-provided input, using summary statistics to inform the rule creation process. This method offers flexibility in rule generation, allowing for both automated and user-guided creation of data quality rules.
  • Added new checks for JSON validation (#616). DQX now includes three new quality checks for JSON data validation, especially useful for validating data coming from streaming systems such as Kafka: is_valid_json, has_json_keys, and has_valid_json_schema. The is_valid_json check verifies whether values in a specified column are valid JSON strings, while the has_json_keys check confirms the presence of specific keys in the outermost JSON object, allowing for optional parameters to require all keys to be present. The has_valid_json_schema check ensures that JSON strings conform to an expected schema, ignoring extra fields not defined in the schema.
  • Added geometry row-level checks (#636). The library has been enhanced with new row-level checks for geometry columns, including checks for area and number of points, such as is_area_not_less_than, is_area_not_greater_than, is_area_equal_to, is_area_not_equal_to, is_num_points_not_less_than, is_num_points_not_greater_than, is_num_points_equal_to, and is_num_points_not_equal_to. These checks allow users to validate geometric data based on specific criteria, with options to specify the spatial reference system (SRID) and use geodesic area calculations. These changes enable more effective validation and quality control of geometric data, and are supported in Databricks serverless compute or runtime versions 17.1 and later.
  • Added support to write using delta table path (#594). The quality check results saving functionality has been enhanced to support saving to Unity Catalog Volume paths, S3, ADLS, or GCS in addition to tables, providing more flexibility in storing and managing results. The save_results_in_table method now accepts output configurations with volume paths, and the OutputConfig object has been updated to support table names with 2 or 3-level namespace, storage paths including Volume paths, S3, ADLS, or GCS, and optional trigger settings for streaming output. Furthermore, the code now supports saving DataFrames to both Delta tables and storage paths, with the save_dataframe_as_table function taking an output_config object that determines whether to save the DataFrame to a table or a path. The functionality includes support for batch and streaming writes, input validation, and error handling, with the existing functionality of saving to Delta tables preserved and new functionality added for saving to storage paths.
  • Extended aggregation check function to support more aggregation types (#951). The aggregation check function has been significantly enhanced to support a wide range of aggregate functions, including 20 curated statistical and percentile-based functions, as well as any Databricks built-in aggregate function, with runtime validation to ensure compatibility and trigger warnings for non-curated functions. The function now accepts an aggr_params parameter to pass parameters to aggregate functions, such as percentile calculations, and supports two-stage aggregation for window-incompatible aggregates like count_distinct. Additionally, the function includes improved error handling, human-readable violation messages, and performance benchmarks for various aggregation scenarios, enabling advanced data quality monitoring and validation capabilities for data engineers and analysts.
  • Added new is_not_in_list check function (#969). A new check function, is_not_in_list, has been added to verify that values in a specified column are not present in a given list of forbidden values, allowing for null values and optional case-insensitive comparisons. This function is suitable for columns that are not of type MapType or StructType, and for optimal performance with large lists of forbidden values, it is recommended to use the foreign_key dataset-level check with the negate argument set to Trueumn to check, the list of forbidden values, and optionally the case sensitivity of the comparison, and its implementation includes input validation and custom error messages, with additional benchmark tests to measure its performance.
  • Improve Generator to emit temporal checks for min/max date & datetime (#624). The data quality generator has been enhanced to support temporal checks for columns with datetime and date types, in addition to numeric types. The generator now creates rules with "is_in_range", "is_not_less_than", and is_not_greater_than functions based on the provided minimum and maximum limits, ensuring correct comparison by verifying that both limit values are of the same type. This update preserves the existing numeric behavior and introduces support for timestamp and date checks, while maintaining the ability to handle Python numeric types without stringification.
  • Improved sql query check funciton to make merge columns parameter optional (#945). The sql_query check has been enhanced to support both row-level and dataset-level validation, allowing for more flexible data validation scenarios. In row-level validation, the check joins query results back to the input data to mark specific rows, whereas in dataset-level validation, the check result applies to all rows, making it suitable for aggregate validations with custom metrics. The merge_columns parameter is now optional, and when not provided, the check performs a dataset-level validation, providing a convenient way to validate entire datasets without requiring specific column mappings. Additionally, the check has been made more robust with input validation and error handling, ensuring that users can perform checks at both the row and dataset levels while preventing incorrect usage with informative error messages.
  • Outlier detection numerical values (#944). The has_no_outliers function has been introduced to detect outliers in numeric columns using the Median Absolute Deviation (MAD) method, which calculates the lower and upper limits as median - 3.5 * MAD and median + 3.5 * MAD, respectively, and considers values outside these limits as outliers. The function is designed to work with numeric columns of type int, float, long, and decimal, and it raises an error if the specified column is not of numeric type. The addition of this function enables the detection of outlier numeric values, enhancing the overall data validation capabilities.
  • Library improvements (#966). The library has undergone updates to improve its functionality, performance, and documentation. The has_json_keys function has been updated to treat NULL values as valid, ensuring consistent behavior across ANSI and non-ANSI modes. Additionally, the functionality of saving DataFrames as tables has been improved, with updated regular expression patterns for table names and enhanced handling of streaming and non-streaming DataFrames.
  • Updated has_valid_schema check to accept a reference dataframe or table (#960). The has_valid_schema check has been enhanced to support validation against a reference dataframe or table, in addition to the existing expected schema. This allows users to verify the schema of their input dataframe against a reference dataframe or table by specifying either the ref_df_name or ref_table parameter, with exactly one of expected_schema, ref_df_name, or ref_table required. The check can be performed in strict mode for exact schema matching or in non-strict mode, which permits extra columns, and users can also specify particular columns to validate using the columns parameter. The function's update includes improved parameter validation, ensuring that only one valid schema source is specified, and new test cases have been added to cover various scenarios, including the use of reference tables and dataframes for schema validation, as well as parameter validation logic.
  • Updated dashboards deployment to use standard lakeview dashboard definitions (#950). The dashboard installer has been updated to use standard Lakeview dashboard definitions.
  • Added null island gemetry check function (#613). A new quality check called is_not_null_island has been introduced to verify whether values in a specified column are NULL island geometries, such as POINT(0 0), POINTZ(0 0 0), or POINTZM(0 0 0 0). The is_not_null_island function requires Databricks serverless compute or runtime version 17.1 or higher.
  • Added float support for range and compare functions (#962). The comparison and validation functions have been enhanced to support float values, in addition to existing support for integers, dates, timestamps, and strings. This update allows for more flexible and nuanced comparisons and range checks, enabling precise and robust validation operations, particularly in scenarios involving decimal or fractional values. The...
Read more

v0.11.1

02 Dec 12:13
d200468

Choose a tag to compare

What's Changed

  • Hotfix to update log level for spark connect to suppress dlt telemetry warnings in non-dlt serverless clusters.

Contributors: @mwojtyczka

v0.11.0

01 Dec 23:40
70c00fe

Choose a tag to compare

  • Generationg of DQX rules from ODCS Data Contracts (#932). The Data Contract Quality Rules Generation feature has been introduced, enabling users to generate data quality rules directly from data contracts following the Open Data Contract Standard (ODCS). This feature supports three types of rule generation: predefined rules derived from schema properties and constraints, explicit DQX rules embedded in the contract, and text-based rules defined in natural language and processed by a Large Language Model (LLM) to generate appropriate checks. The feature provides rich metadata tracing generated rules back to the source contract for lineage and governance, and it can be used to implement federated data governance, standardize data contracts, and maintain version-controlled quality rules alongside schema definitions.
  • AI-Assisted Primary Key Detection and Uniqueness Rules Generation (#934). Introduced AI-assisted primary key detection and uniqueness rules generation capabilities, leveraging Large Language Models (LLMs) to analyze table schema and metadata. This feature analyzes table schemas and metadata to intelligently detect single or composite primary keys, and performs validation by checking for duplicate values. The DQProfiler class now includes a detect_primary_keys_with_llm method, which returns a dictionary containing the primary key detection result, including the table name, success status, detected primary key columns, confidence level, reasoning, and error message if any. The DQGenerator class has been extended to utilize uniqueness profiles from the profiler for AI-assisted uniqueness rules generation. Various updates have been made to the configuration options, including the addition of an llm_primary_key_detection option, which allows users to control whether AI-assisted primary key detection is enabled or disabled.
  • AI-Assisted Rules Generation Improvements (#925). The AI-Assisted Rules Generation feature has been enhanced to handle input as a path in addition to a table, and to generate rules with a filter. The generate_dq_rules_ai_assisted method now accepts an InputConfig object, which allows users to specify the location and format of the input data, enabling more flexible input handling and filtering capabilities. The feature includes test cases to verify its functionality, including manual tests, unit tests, and integration tests, and the documentation has been updated with minor changes to reflect the new functionality. Additionally, the code has been modified to capitalize keywords to stabilize integration tests, and the DQGenerator class has been updated to accommodate the changes, allowing users to generate data quality rules from a variety of input sources. The InputConfig class provides a flexible way to configure the input data, including its location and format, and the get_column_metadata function has been introduced to retrieve column metadata from a given location. Overall, these updates aim to enhance the functionality and usability of the AI-assisted rules generation feature, providing more flexibility and accuracy in generating data quality rules.
  • Added case-insensitive comparison support to is_in_list and is_not_null_and_is_in_list checks (#673). The is_in_list and is_not_null_and_is_in_list check functions have been enhanced to support case-insensitive comparison, allowing users to choose between case-sensitive and case-insensitive comparisons via an optional case_sensitive boolean flag that defaults to True. These checks verify if values in a specified column are present in a list of allowed values, with the is_not_null_and_is_in_list check also requiring the values to be non-null. The updated checks provide more flexibility in data validation, enabling users to configure parameters such as the column to check, the list of allowed values, and the case sensitivity flag. However, it is recommended to use the foreign_key dataset-level check for large lists of allowed values or for columns of type MapType or StructType, as these checks are not suitable for such scenarios.
  • Added documentation for using DQX in streaming scenarios with foreach batch (#948). Documentation and example code snippets were added to demonstrate how to apply checks in foreachBatch structured streaming function.
  • Added telemetry to track count of input tables (#954). Added additional telemetry for better trakcing of DQX usage to help improve the product.
  • Added support for installing DQX from private PYPI repositories (#930). The DQX library has been enhanced with support for installing DQX using a company-hosted PyPI mirror, which is necessary for enterprises that block the public PyPI index. The documentation has been added to describe the feature. The tool installation code has been modified to include new functionality for automatically upload dependencies to a workspace when internet access is blocked.
  • Support Custom Folder Installation for CLI Commands (#942). The command-line interface (CLI) has been enhanced to support custom installation folders, providing users with greater flexibility when working with the library. A new --install-folder argument has been introduced, allowing users to specify a custom installation folder when running various CLI commands, such as opening dashboards, workflows, logs, and profiles. This argument override the default installation location to support scenarios where the user installs DQX in a custom location. The library's dependency on sqlalchemy has also been updated to require a version greater than or equal to 2.0 and less than 3.0 to avoid dependency issues in older DBRs.
  • Enhancement to end to end tests (#921). The e2e tests has been enhanced to test integration with dbt transformation framework. Additionally, the documentation for contributing to the project and testing has been updated to simplify the setup process for running tests locally.

BREAKING CHANGES!

  • Renamed level parameter to criticality in generate_dq_rules method of DQGenerator for consistency.
  • Replaced table: str parameter with input_config: InputConfig in profile_table method of DQProfiler for greater flexibility.
  • Replaced table_name: str parameter with input_config: InputConfig in generate_dq_rules_ai_assisted method of DQGenerator for greater flexibility.

Contributors: @dinbab1984, @mwojtyczka, @ghanse, @vb-dbrks, @jominjohny, @AdityaMandiwal

v0.10.0

07 Nov 08:11
6432e45

Choose a tag to compare

  • Added Data Quality Summary Metrics (#553). The data quality engine has been enhanced with the ability to track and manage summary metrics for data quality validation, leveraging Spark's Observation feature. A new DQMetricsObserver class has been introduced to manage Spark observations and track summary metrics on datasets checked with the engine. The DQEngine class has been updated to optionally return the Spark observation associated with a given run, allowing users to access and save summary metrics. The engine now supports also writing summary metrics to a table using the metrics_config parameter, and a new save_summary_metrics method has been added to save data quality summary metrics to a table. Additionally, the engine has been updated to include a unique run_id field in the detailed per-row quality results, enabling cross-referencing with summary metrics. The changes also include updates to the configuration file to support the storage of summary metrics. Overall, these enhancements provide a more comprehensive and flexible data quality checking capability, allowing users to track and analyze data quality issues more effectively.
  • LLM assisted rules generation (#577). This release introduces a significant enhancement to the data quality rules generation process with the integration of AI-assisted rules generation using large language models (LLMs). The DQGenerator class now includes a generate_dq_rules_ai_assisted method, which takes user input in natural language and optionally a schema from an input table to generate data quality rules. These rules are then validated for correctness. The AI-assisted rules generation feature supports both programmatic and no-code approaches. Additionally, the feature enables the use of different LLM models and gives the possibility to use custom check functions. The release also includes various updates to the documentation, configuration files, and testing framework to support the new AI-assisted rules generation feature, ensuring a more streamlined and efficient process for defining and applying data quality rules.
  • Added Lakebase checks storage backend (#550). A Lakebase checks storage backend was added, allowing users to store and manage their data quality rules in a centralized lakabase table, in addition to the existing Delta table storage. The checks_location resolution has been updated to accommodate Lakebase, supporting both table and file storage, with flexible formatting options, including "catalog.schema.table" and "database.schema.table". The Lakebase checks storage backend is configurable through the LakebaseChecksStorageConfig class, which includes fields for instance name, user, location, port, run configuration name, and write mode. This update provides users with more flexibility in storing and loading quality checks, ensuring that checks are saved correctly regardless of the specified location format.
  • Added runtime validation of sql expressions (#625). The data quality check functionality has been enhanced with runtime validation of SQL expressions, ensuring that specified fields can be resolved in the input DataFrame and that SQL expressions are valid before evaluation. If an SQL expression is invalid, the check evaluation is skipped and the results include a check failure with a descriptive message. Additionally, the configuration validation for Unity Catalog volume file paths has been improved to enforce a specific format, preventing invalid configurations and providing more informative error messages.
  • Fixed docs (#598). The documentation build process has undergone significant improvements to enhance efficiency and maintainability.
  • Improved Config Serialization (#676). Several updates have been made to improve the functionality, consistency, and maintainability of the codebase. The configuration loading functionality has been refactored to utilize the ConfigSerializer class, which handles the serialization and deserialization of workspace and run configurations.
  • Restore use of hatch-fancy-pypi-readme to fix images in PyPi (#601). The image source path for the logo in the README has been modified to correctly display the logo image when rendered, particularly on PyPi.
  • Skip check evaluation if columns or filter cannot be resolved in the input DataFrame (#609). DQX now skip check evaluation if columns or filters are incorrect allowing other checks to proceed even if one rule fails. The DQX engine validates specified column, columns and filter fields against the input DataFrame before applying checks, skipping evaluation and providing informative error messages if any fields are invalid.
  • Updated user guide docs (#607). The documentation for quality checking and integration options has been updated to provide accurate and detailed information on supported types and approaches. Quality checking can be performed in-transit (pre-commit), validating data on the fly during processing, or at-rest, checking existing data stored in tables.
  • Improved build process (#618). The hatch version has been updated to 1.15.0 to avoid compatibility issues with click version 8.3 and later, which introduced a bug affecting hatch. Additionally, the project's dependencies have been updated, including bumping the databricks-labs-pytester version from 0.7.2 to 0.7.4, and code refactoring has been done to use a single Lakebase instance for all integration tests, with retry logic added to handle cases where the workspace quota limit for the number of Lakebase instances is exceeded, enhancing the testing infrastructure and improving test reliability. Furthermore, documentation updates have been made to clarify the application of quality checks to data using DQX. These changes aim to improve the efficiency, reliability, and clarity of the project's testing and documentation infrastructure.

BREAKING CHANGES!

  • Added new field run_id to the detailed per-row quality results. This may or may not be a breaking change for you depending on how you leverage the results today. This is a unique run ID recorded in the summary metrics as well as detailed quality checking results to enable cross-referencing. When reusing the same DQEngine instance, the run ID stays the same. Each apply checks execution does not generate a new run ID for the same instance. It is only changed when new engine and observer (if using one) is created.

LIMITATIONS

  • Saving metrics to a table requires using a classic compute cluster in Dedicated Access Mode. This limitation will be lifted observations issue is fixed in Spark Connect.

Contributors: @mwojtyczka, @ghanse, @souravg-db2, @vb-dbrks, @alexott, @tlgnr

v0.9.3

03 Oct 16:53
d9d47a9

Choose a tag to compare

  • Added support for running checks on multiple tables (#566). Added more flexibility and functionality in running data quality checks, allowing users to run checks on multiple tables in a single method call and as part of Workflows execution. Provided options to run checks for all configured run configs or for a specific run config, or for tables/views matching wildcard patterns. The CLI commands for running workflows have been updated to reflect and support these new functionalities. Additionally, new parameters have been added to configuration file to control the level of parallelism for these operations, such as profiler_max_parallelism and quality_checker_max_parallelism. A new demo has been added to showcases how to use the profiler and apply checks across multiple tables. The changes aim to improve scalability of DQX.
  • Added New Row-level Checks: IPv6 Address Validation (#578). DQX now includes 2 new row-level checks: validation of IPv6 address (is_valid_ipv6_address check function), and validation if IPv6 address is within provided CIDR block (is_ipv6_address_in_cidr check function).
  • Added New Dataset-level Check: Schema Validation check (#568). The has_valid_schema check function has been introduced to validate whether a DataFrame conforms to a specified schema, with results reported at the row level for consistency with other checks. This function can operate in non-strict mode, where it verifies the existence of expected columns with compatible types, or in strict mode, where it enforces an exact schema match, including column order and types. It accepts parameters such as the expected schema, which can be defined as a DDL string or a StructType object, and optional arguments to specify columns to validate and strict mode.
  • Added New Row-level Checks: Spatial data validations (#581). Specialized data validation checks for geospatial data have been introduced, enabling verification of valid latitude and longitude values, various geometry and geography types, such as points, linestrings, polygons, multipoints, multilinestrings, and multipolygons, as well as checks for Open Geospatial Consortium (OGC) validity, non-empty geometries, and specific dimensions or coordinate ranges. These checks are implemented as check functions, including is_latitude, is_longitude, is_geometry, is_geography, is_point, is_linestring, is_polygon, is_multipoint, is_multilinestring, is_multipolygon, is_ogc_valid, is_non_empty_geometry, has_dimension, has_x_coordinate_between, and has_y_coordinate_between. The addition of these geospatial data validation checks enhances the overall data quality capabilities, allowing for more accurate and reliable geospatial data processing and analysis. Running these checks requires Databricks serverless or cluster with runtime 17.1 or above.
  • Added absolute and relative tolerance to comparison of datasets (#574). The compare_datasets check has been enhanced with the introduction of absolute and relative tolerance parameters, enabling more flexible comparisons of decimal values. These tolerances can be applied to numeric columns.
  • Added detailed telemetry (#561). Telemetry has been enhanced across multiple functionalities to provide better visibility into DQX usage, including which features and checks are used most frequently. This will help us focus development efforts on the areas that matter most to our users.
  • Allow installation in a custom folder (#575). The installation process for the library has been enhanced to offer flexible installation options, allowing users to install the library in a custom workspace folder, in addition to the default user home directory or a global folder. When installing DQX as a workspace tool using the Databricks CLI, users are prompted to optionally specify a custom workspace path for the installation. Allowing custom installation folder makes it possible to use DQX on group assigned cluster.
  • Profile subset dataframe (#589). The data profiling feature has been enhanced to allow users to profile and generate rules on a subset of the input data by introducing a filter option, which is a string SQL expression that can be used to filter the input data. This filter can be specified in the configuration file or when using the profiler, providing more flexibility in analyzing subsets of data. The profiler supports extensive configuration options to customize the profiling process, including sampling, limiting, and computing statistics on the sampled data. The new filter option enables users to generate more targeted and relevant rules, and it can be used to focus on particular segments of the data, such as rows that match certain conditions.
  • Added custom exceptions (#582). The codebase now utilizes custom exceptions to handle various error scenarios, providing more specific and informative error messages compared to generic exceptions.

BREAKING CHANGES!

  • Workflows run by default for all run configs from configuration file. Previously, the default behaviour was to run them for a specific run config only.
  • The following depreciated methods are removed from the DQEngine: load_checks_from_local_file, load_checks_from_workspace_file, load_checks_from_table, load_checks_from_installation, save_checks_in_local_file, save_checks_in_workspace_file, save_checks_in_table,, save_checks_in_installation, load_run_config. For loading and saving checks, users are advised to use load_checks and save_checks of the DQEngine described here, which support various storage types.

Contributors: @mwojtyczka, @ghanse, @tdikland, @Divya-Kovvuru-0802, @cornzyblack, @STEFANOVIVAS

v0.9.2

05 Sep 19:59
a22ab79

Choose a tag to compare

  • Added performance benchmarks
    (#548). Performance tests are run to ensure performance does not degrade by more than 25% by any change. Benchmark results are published in the documentation in the reference section. The benchmark covers all check functions, running all funcitons at once and applying the same funcitons at once for multiple columns using foreach column. A new performance GitHub workflow has been introduced to automate performance benchmarking, generating a new benchmark baseline, updating the existing baseline, and running performance tests to compare with the baseline.
  • Declare readme in the project (#547). The project configuration has been updated to include README file in the released package so that it is visible in PyPi.
  • Fixed deserializing to DataFrame to assign columns properly (#559). The deserialize_checks_to_dataframe function has been enhanced to correctly handle columns for sql_expression by removing the unnecessary check for DQDatasetRule instance and directly verifying if dq_rule_check.columns is not None.
  • Fixed lsql dependency (#564). The lsql dependency has been updated to address a sqlglot dependency issue that arises when imported in artifacts repositories.

Contributors: @mwojtyczka @ghanse @cornzyblack @gchandra10

v0.9.1

25 Aug 10:56
ee802c4

Choose a tag to compare

0.9.1

  • Added quality checker and end to end workflows (#519). This release introduces no-code solution for applying checks. The following workflows were added: quality-checker (apply checks and save results to tables) and end-to-end (e2e) workflows (profile input data, generate quality checks, apply the checks, save results to tables). The workflows enable quality checking for data at-rest without the need for code-level integration. It supports reference data for checks using tables (e.g., required by foreign key or compare datasets checks) as well as custom python check functions (mapping of custom check funciton to the module path in the workspace or Unity Catalog volume containing the function definition). The workflows handle one run config for each job run. Future release will introduce functionality to execute this across multiple tables. In addition, CLI commands have been added to execute the workflows. Additionaly, DQX workflows are configured now to execute using serverless clusters, with an option to use standards clusters as well. InstallationChecksStorageHandler now support absolute workspace path locations.
  • Added built-in row-level check for PII detection (#486). Introduced a new built-in check for Personally Identifiable Information (PII) detection, which utilizes the Presidio framework and can be configured using various parameters, such as NLP entity recognition configuration. This check can be defined using the does_not_contain_pii check function and can be customized to suit specific use cases. The check requires pii extras to be installed: pip install databricks-labs-dqx[pii]. Furthermore, a new enum class NLPEngineConfig has been introduced to define various NLP engine configurations for PII detection. Overall, these updates aim to provide more robust and customizable quality checking capabilities for detecting PII data.
  • Added equality row-level checks (#535). Two new row-level checks, is_equal_to and is_not_equal_to, have been introduced to enable equality checks on column values, allowing users to verify whether the values in a specified column are equal to or not equal to a given value, which can be a numeric literal, column expression, string literal, date literal, or timestamp literal.
  • Added demo for Spark Structured Streaming (#518). Added demo to showcase usage of DQX with Spark Structured Streaming for in-transit data quality checking. The demo is available as Databricks notebook, and can be run on any Databricks workspace.
  • Added clarification to profiler summary statistics (#523). Added new section on understanding summary statistics, which explains how these statistics are computed on a sampled subset of the data and provides a reference for the various summary statistics fields.
  • Fixed rounding datetimes in the checks generator (#517). The generator has been enhanced to correctly handle midnight values when rounding "up", ensuring that datetime values already at midnight remain unchanged, whereas previously they were rounded to the next day.
  • Added API Docs (#520). The DQX API documentation is generated automatically using docstrings. As part of this change the library's documentation has been updated to follow Google style.
  • Improved test automation by adding end-to-end test for the asset bundles demo (#533).

BREAKING CHANGES!

  • ExtraParams was moved from databricks.labs.dqx.rule module to databricks.labs.dqx.config

Contributors: @mwojtyczka @ghanse @renardeinside @cornzyblack @bsr-the-mngrm @dinbab1984 @AdityaMandiwal

v0.8.0

06 Aug 22:18
fe5d6a3

Choose a tag to compare

  • Added new row-level freshness check (#495). A new data quality check function, is_data_fresh, has been introduced to identify stale data resulting from delayed pipelines, enabling early detection of upstream issues. This function assesses whether the values in a specified timestamp column are within a specified number of minutes from a base timestamp column. The function takes three parameters: the column to check, the maximum age in minutes before data is considered stale, and an optional base timestamp column, defaulting to the current timestamp if not provided.
  • Added new dataset-level freshess check (#499). A new dataset-level check function, is_data_fresh_per_time_window, has been added to validate whether at least a specified minimum number of records arrive within every specified time window, ensuring data freshness. This function is customizable, allowing users to define the time window, minimum records per window, and lookback period.
  • Improvements have been made to the performance of aggregation check functions, and the check message format has been updated for better readability.
  • Created llm util function to get check functions details (#469). A new utility function has been introduced to provide definitions of all check functions, enabling the generation of prompts for Large Language Models (LLMs) to create check functions.
  • Added equality safe row and column matching in compare datasets check (#473). The compare datasets check functionality has been enhanced to handle null values during row matching and column value comparisons, improving its robustness and flexibility. Two new optional parameters, null_safe_row_matching and null_safe_column_value_matching, have been introduced to control how null values are handled, both defaulting to True. These parameters allow for null-safe primary key matching and column value matching, ensuring accurate comparison results even when null values are present in the data. The check now excludes specific columns from value comparison using the exclude_columns parameter while still considering them for row matching.
  • Fixed datetime rounding logic in profiler (#483). The datetime rounding logic has been improved in profiler to respect the round=False option, which was previously ignored. The code now handles the OverflowError that occurs when rounding up the maximum datetime value by capping the result and logging a warning.
  • Added loading and saving checks from file in Unity Catalog Volume (#512). This change introduces support for storing quality checks in a Unity Catalog Volume, in addition to existing storage types such as tables, files, and workspace files. The storage location of quality checks has been unified into a single configuration field called checks_location, replacing the previous checks_file and checks_table fields, to simplify the configuration and remove ambiguity by ensuring only one storage location can be defined per run configuration. The checks_location field can point to a file in the local path, workspace, installation folder, or Unity Catalog Volume, providing users with more flexibility and clarity when managing their quality checks.
  • Refactored methods for loading and saving checks (#487). The DQEngine class has undergone significant changes to improve modularity and maintainability, including the unification of methods for loading and saving checks under the load_checks and save_checks methods, which take a config parameter to determine the storage type, such as FileChecksStorageConfig, WorkspaceFileChecksStorageConfig, TableChecksStorageConfig, or InstallationChecksStorageConfig.
  • Storing checks using dqx classes (#474). The data quality engine has been enhanced with methods to convert quality checks between DQRule objects and Python dictionaries, allowing for flexibility in check definition and usage. The serialize_checks method converts a list of DQRule instances into a dictionary representation, while the deserialize_checks method performs the reverse operation, converting a dictionary representation back into a list of DQRule instances. Additionally, the DQRule class now includes a to_dict method to convert a DQRule instance into a structured dictionary, providing a standardized representation of the rule's metadata. These changes enable users to work with checks in both formats, store and retrieve checks easily, and improve the overall management and storage of data quality checks. The conversion process supports local execution and handles non-complex column expressions, although complex PySpark expressions or Python functions may not be fully reconstructable when converting from class to metadata format.
  • Added llm utility funciton to extract checks examples in yaml from docs (#506). This is achieved through a new Python script that extracts YAML examples from MDX documentation files and creates a combined YAML file with all the extracted examples. The script utilizes regular expressions to extract YAML code blocks from MDX content, validates each YAML block, and combines all valid blocks into a single list. The combined YAML file is then created in the LLM resources directory for use in language model processing.

BREAKING CHANGES!

  • The checks_file and checks_table fields have been removed from the installation run configuration. They are now consolidated into the single checks_location field. This change simplifies the configuration and clearly defines where checks are stored.
  • The load_run_config method has been moved to config_loader.RunConfigLoader, as it is not intended for direct use and falls outside the DQEngine core responsibilities.

DEPRECIATION CHANGES!

If you are loading or saving checks from a storage (file, workspace file, table, installation), you are affected. We are deprecating the below methods. We are keeping the methods in the DQEngine but you should update your code as these methods will be removed in future versions.

  • Loading checks to storage has been unified under load_checks method. The following methods have been removed from the DQEngine:
    load_checks_from_local_file, load_checks_from_workspace_file, load_checks_from_installation, load_checks_from_table.
  • Saving checks in storage has been unified under load_checks method. The following methods have been removed from the DQEngine:
    save_checks_in_local_file, save_checks_in_workspace_file, save_checks_in_installation, save_checks_in_table.

The save_checks and load_checks take config as a parameter, which determines the storage types used. The following storage configs are currently supported:

  • FileChecksStorageConfig: file in the local filesystem (YAML or JSON)
  • WorkspaceFileChecksStorageConfig: file in the workspace (YAML or JSON)
  • TableChecksStorageConfig: a table
  • InstallationChecksStorageConfig: storage defined in the installation context, using either the checks_table or checks_file field from the run configuration.

Contributors: @mwojtyczka, @karthik-ballullaya-db, @bsr-the-mngrm, @ajinkya441, @cornzyblack, @ghanse, @jominjohny, @dinbab1984

v0.7.1

23 Jul 16:41
6607b36

Choose a tag to compare

  • Added type validation for apply checks method (#465). The library now enforces stricter type validation for data quality rules, ensuring all elements in the checks list are instances of DQRule. If invalid types are encountered, a TypeError is raised with a descriptive error message, suggesting alternative methods for passing checks as dictionaries. Additionally, input attribute validation has been enhanced to verify the criticality value, which must be either warn or "error", and raises a ValueError for invalid values.
  • Databricks Asset Bundle (DAB) demo (#443). A new demo showcasing the usage of DQX with DAB has been added.
  • Check to compare datasets (#463). A new dataset-level check, compare_datasets, has been introduced to compare two DataFrames at both row and column levels, providing detailed information about differences, including new or missing rows and column-level changes. This check compares only columns present in both DataFrames, excludes map type columns, and can be customized to exclude specific columns or perform a FULL OUTER JOIN to identify missing records. The compare_datasets check can be used with a reference DataFrame or table name, and its results include information about missing and extra rows, as well as a map of changed columns and their differences.
  • Demo on how to use DQX with dbt projects (#460). New demo has been added to showcase on how to use DQX with dbt transformation framework.
  • IP V4 address validation (#464). The library has been enhanced with new checks to validate IPv4 address. Two new row checks, is_valid_ipv4_address and is_ipv4_address_in_cidr, have been introduced to verify whether values in a specified column are valid IPv4 addresses and whether they fall within a given CIDR block, respectively.
  • Improved loading checks from Delta table (#462). Loading checks from Delta tables have been improved to eliminate the need to escape string arguments, providing a more robust and user-friendly experience for working with quality checks defined in Delta tables.

Contributors: @mwojtyczka, @cornzyblack, @ghanse, @grusin-db