Releases: databrickslabs/dqx
v0.13.0
What's Changed
- New DQX Data Quality Dashboard (#1019). The data quality dashboard has been significantly enhanced to provide a centralized view of data quality metrics across all tables, allowing users to monitor and track data quality issues with greater ease. The dashboard now consists of three tabs - Data Quality Summary, Data Quality by Table (Time Series), and Data Quality by Table (Full Snapshot) - each catering to different monitoring scenarios, and offers customizable parameters for reporting column names and filtering tables with data quality issues. Additionally, the installation process for the dashboard has been simplified, with options to import it directly to a Workspace or deploy it automatically using the Databricks CLI.
- DQX App Skeleton (#982). The DQX application (frontend and backend) has been built with a core set of features, including configuration management and AI-assisted rule generation based on natural-language input from users. A comprehensive README documents the application architecture as well as development and deployment workflows. Future versions of DQX will introduce additional functionality (loading/saving rules, rules authoring in graphical form) and provide a streamlined, user-friendly way to deploy the application directly into a Databricks workspace.
- Added Decimal support to check functions and to min_max generator (#1013) (#1017). The data quality checks have been enhanced to support Python's Decimal type, in addition to int and float, for min/max validation checks, enabling proper data quality checks for decimal-precise financial and scientific data where floating-point precision issues would cause false positives.
- Added DQX produciton best practices and fix datetime limit handling (#997). Practical guidance and best practices for using DQX in production have been added, covering aspects such as storing checks in Delta tables, enforcing access controls, and optimizing rules for performance and scalability. Fixes have also been implemented to address issues related to handling date and datetime limits, particularly when provided as strings.
- Added new row-level check functions: is_null, is_empty, and is_null_or_empty (#1015). DQX now includes three new check functions,
is_null,is_empty, andis_null_or_empty, which enable verification of column values as null, empty strings, or both, complementing existing checks likeis_not_null,is_not_empty, andis_not_null_and_not_empty. The functions also support optional arguments, liketrim_stringsto trim spaces from strings. - Added tolerance to equality and non-equality check functions (#1011). The library's quality check functionality has been enhanced to support absolute and relative tolerance parameters for numeric value comparisons in
is_equal_to,is_not_equal_to,is_aggr_equalandis_aggr_not_equalchecks, allowing for more flexible and precise control over data validation. The introduction of tolerance logic, which checks for absolute and relative differences within specified thresholds viaabs_toleranceandrel_toleranceparameters, provides more nuanced comparisons for numeric data. - Allow new lines in sql expression checks (#1009). SQL expression check function (
sql_expression) has been updated to support new lines in its expression argument, allowing for more complex and formatted SQL expressions. - Allow summary metrics with SparkConnect sessions (#1000). The library now supports writing summary metrics directly to a table with SparkConnect sessions, eliminating the need for a classic compute cluster in Dedicated access mode. This change lifts the previous restriction and enables generatic summary metrics using Serverless and all standard clusters with Databricks Runtime 17.3LTS or higeher.
- Fixed loading checks from a delta table with special characters (#992). The loading checks functionality from a delta table has been fixed to handle special characters in the fully qualified table.
- Fixed resolution of pii detection check function (#1003). The PII detection check function resolution has been enhanced to support the application of checks defined as metadata (YAML).
- Fixed serialization/deserialization of row filter parameter for dataset-level rules (#1021). The
filterfield in checks definition now correctly pushes down thefiltercondition defined at the check-level asrow_filterto the check function, allowing checks to operate on the relevant subset of rows before aggregation. The documentation has been updated to advice users to use op-levelfiltercondition for consistency instead ofrow_filterparameter. Overall, these changes aim to enhance the overall user experience. - Improved Lakeflow Declarative Pipeline tests (#1010). The Lakeflow Declarative Pipeline (LDP) tests have been enhanced to utilize full Unity Catalog mode, enabling support for writing to arbitrary catalogs and schemas, and performing additional checks to prevent certain operations.
- Updated Lakebase authentication method (#975). The Lakebase authentication method has been updated to utilize a client ID instead of a username, simplifying its use in the context of a Databricks App. The
lakebase_userparameter has been replaced withlakebase_client_id, an optional service principal client ID used to connect to Lakebase, defaulting to the caller's identity if not provided. This change enhances the security and reliability of the authentication process, making it easier to work with Lakebase as a checks storage. - Updated handling of metadata columns during schema validation (#1002). The
has_valid_schemacheck has been enhanced to provide more flexibility in schema validation by introducing an optionalexclude_columnsparameter, allowing users to specify columns to ignore during validation. This parameter can be used to exclude metadata columns or other columns not relevant to schema validation, and it takes precedence over thecolumnslist. - Updated product info when missing in config while verifying workspace client (#987). The workspace client configuration has been enhanced to default product information to
dqxwith the current version when it is missing, ensuring that product information is always set for telemetry purposes. - Updated profiler and generator documentation (#1026). The data profiling and quality checks generation feature has been enhanced with updated documentation, providing reference information for data quality profile types and associated rules.
- Added filter attribute in rules generated from ODCS (#978). The rules generation process has been enhanced with the introduction of a filter attribute in rules generated from Open Data Contract Standard (ODCS), allowing for more flexible and targeted rules creation.
Contributors
@mwojtyczka @ghanse @alexott @nehamilak-db @cornzyblack @laurencewells @renardeinside @tlgnr @pierre-monnet @sheeluvikas @ashwin-911 @dwanneruchi @bpm1993 @Jgprog117
Full Changelog: v0.12.0...v0.13.0
v0.12.0
What's Changed
- AI-Assisted rules generation from data profiles (#963). AI-assisted data quality rule generation was added, leveraging summary statistics from a profiler to create rules. The
DQGeneratorclass includes agenerate_dq_rules_ai_assistedmethod that can generate rules with or without user-provided input, using summary statistics to inform the rule creation process. This method offers flexibility in rule generation, allowing for both automated and user-guided creation of data quality rules. - Added new checks for JSON validation (#616). DQX now includes three new quality checks for JSON data validation, especially useful for validating data coming from streaming systems such as Kafka:
is_valid_json,has_json_keys, andhas_valid_json_schema. Theis_valid_jsoncheck verifies whether values in a specified column are valid JSON strings, while thehas_json_keyscheck confirms the presence of specific keys in the outermost JSON object, allowing for optional parameters to require all keys to be present. Thehas_valid_json_schemacheck ensures that JSON strings conform to an expected schema, ignoring extra fields not defined in the schema. - Added geometry row-level checks (#636). The library has been enhanced with new row-level checks for geometry columns, including checks for area and number of points, such as
is_area_not_less_than,is_area_not_greater_than,is_area_equal_to,is_area_not_equal_to,is_num_points_not_less_than,is_num_points_not_greater_than,is_num_points_equal_to, andis_num_points_not_equal_to. These checks allow users to validate geometric data based on specific criteria, with options to specify the spatial reference system (SRID) and use geodesic area calculations. These changes enable more effective validation and quality control of geometric data, and are supported in Databricks serverless compute or runtime versions 17.1 and later. - Added support to write using delta table path (#594). The quality check results saving functionality has been enhanced to support saving to Unity Catalog Volume paths, S3, ADLS, or GCS in addition to tables, providing more flexibility in storing and managing results. The
save_results_in_tablemethod now accepts output configurations with volume paths, and theOutputConfigobject has been updated to support table names with 2 or 3-level namespace, storage paths including Volume paths, S3, ADLS, or GCS, and optional trigger settings for streaming output. Furthermore, the code now supports saving DataFrames to both Delta tables and storage paths, with thesave_dataframe_as_tablefunction taking anoutput_configobject that determines whether to save the DataFrame to a table or a path. The functionality includes support for batch and streaming writes, input validation, and error handling, with the existing functionality of saving to Delta tables preserved and new functionality added for saving to storage paths. - Extended aggregation check function to support more aggregation types (#951). The aggregation check function has been significantly enhanced to support a wide range of aggregate functions, including 20 curated statistical and percentile-based functions, as well as any Databricks built-in aggregate function, with runtime validation to ensure compatibility and trigger warnings for non-curated functions. The function now accepts an
aggr_paramsparameter to pass parameters to aggregate functions, such as percentile calculations, and supports two-stage aggregation for window-incompatible aggregates likecount_distinct. Additionally, the function includes improved error handling, human-readable violation messages, and performance benchmarks for various aggregation scenarios, enabling advanced data quality monitoring and validation capabilities for data engineers and analysts. - Added new is_not_in_list check function (#969). A new check function,
is_not_in_list, has been added to verify that values in a specified column are not present in a given list of forbidden values, allowing for null values and optional case-insensitive comparisons. This function is suitable for columns that are not of typeMapTypeorStructType, and for optimal performance with large lists of forbidden values, it is recommended to use theforeign_keydataset-level check with thenegateargument set toTrueumn to check, the list of forbidden values, and optionally the case sensitivity of the comparison, and its implementation includes input validation and custom error messages, with additional benchmark tests to measure its performance. - Improve Generator to emit temporal checks for min/max date & datetime (#624). The data quality generator has been enhanced to support temporal checks for columns with datetime and date types, in addition to numeric types. The generator now creates rules with "is_in_range", "is_not_less_than", and
is_not_greater_thanfunctions based on the provided minimum and maximum limits, ensuring correct comparison by verifying that both limit values are of the same type. This update preserves the existing numeric behavior and introduces support for timestamp and date checks, while maintaining the ability to handle Python numeric types without stringification. - Improved sql query check funciton to make merge columns parameter optional (#945). The
sql_querycheck has been enhanced to support both row-level and dataset-level validation, allowing for more flexible data validation scenarios. In row-level validation, the check joins query results back to the input data to mark specific rows, whereas in dataset-level validation, the check result applies to all rows, making it suitable for aggregate validations with custom metrics. Themerge_columnsparameter is now optional, and when not provided, the check performs a dataset-level validation, providing a convenient way to validate entire datasets without requiring specific column mappings. Additionally, the check has been made more robust with input validation and error handling, ensuring that users can perform checks at both the row and dataset levels while preventing incorrect usage with informative error messages. - Outlier detection numerical values (#944). The
has_no_outliersfunction has been introduced to detect outliers in numeric columns using the Median Absolute Deviation (MAD) method, which calculates the lower and upper limits as median - 3.5 * MAD and median + 3.5 * MAD, respectively, and considers values outside these limits as outliers. The function is designed to work with numeric columns of type int, float, long, and decimal, and it raises an error if the specified column is not of numeric type. The addition of this function enables the detection of outlier numeric values, enhancing the overall data validation capabilities. - Library improvements (#966). The library has undergone updates to improve its functionality, performance, and documentation. The
has_json_keysfunction has been updated to treat NULL values as valid, ensuring consistent behavior across ANSI and non-ANSI modes. Additionally, the functionality of saving DataFrames as tables has been improved, with updated regular expression patterns for table names and enhanced handling of streaming and non-streaming DataFrames. - Updated
has_valid_schemacheck to accept a reference dataframe or table (#960). Thehas_valid_schemacheck has been enhanced to support validation against a reference dataframe or table, in addition to the existing expected schema. This allows users to verify the schema of their input dataframe against a reference dataframe or table by specifying either theref_df_nameorref_tableparameter, with exactly one ofexpected_schema,ref_df_name, orref_tablerequired. The check can be performed in strict mode for exact schema matching or in non-strict mode, which permits extra columns, and users can also specify particular columns to validate using thecolumnsparameter. The function's update includes improved parameter validation, ensuring that only one valid schema source is specified, and new test cases have been added to cover various scenarios, including the use of reference tables and dataframes for schema validation, as well as parameter validation logic. - Updated dashboards deployment to use standard lakeview dashboard definitions (#950). The dashboard installer has been updated to use standard Lakeview dashboard definitions.
- Added null island gemetry check function (#613). A new quality check called
is_not_null_islandhas been introduced to verify whether values in a specified column are NULL island geometries, such as POINT(0 0), POINTZ(0 0 0), or POINTZM(0 0 0 0). Theis_not_null_islandfunction requires Databricks serverless compute or runtime version 17.1 or higher. - Added float support for range and compare functions (#962). The comparison and validation functions have been enhanced to support float values, in addition to existing support for integers, dates, timestamps, and strings. This update allows for more flexible and nuanced comparisons and range checks, enabling precise and robust validation operations, particularly in scenarios involving decimal or fractional values. The...
v0.11.1
What's Changed
- Hotfix to update log level for spark connect to suppress dlt telemetry warnings in non-dlt serverless clusters.
Contributors: @mwojtyczka
v0.11.0
- Generationg of DQX rules from ODCS Data Contracts (#932). The Data Contract Quality Rules Generation feature has been introduced, enabling users to generate data quality rules directly from data contracts following the Open Data Contract Standard (ODCS). This feature supports three types of rule generation: predefined rules derived from schema properties and constraints, explicit DQX rules embedded in the contract, and text-based rules defined in natural language and processed by a Large Language Model (LLM) to generate appropriate checks. The feature provides rich metadata tracing generated rules back to the source contract for lineage and governance, and it can be used to implement federated data governance, standardize data contracts, and maintain version-controlled quality rules alongside schema definitions.
- AI-Assisted Primary Key Detection and Uniqueness Rules Generation (#934). Introduced AI-assisted primary key detection and uniqueness rules generation capabilities, leveraging Large Language Models (LLMs) to analyze table schema and metadata. This feature analyzes table schemas and metadata to intelligently detect single or composite primary keys, and performs validation by checking for duplicate values. The
DQProfilerclass now includes adetect_primary_keys_with_llmmethod, which returns a dictionary containing the primary key detection result, including the table name, success status, detected primary key columns, confidence level, reasoning, and error message if any. TheDQGeneratorclass has been extended to utilize uniqueness profiles from the profiler for AI-assisted uniqueness rules generation. Various updates have been made to the configuration options, including the addition of anllm_primary_key_detectionoption, which allows users to control whether AI-assisted primary key detection is enabled or disabled. - AI-Assisted Rules Generation Improvements (#925). The AI-Assisted Rules Generation feature has been enhanced to handle input as a path in addition to a table, and to generate rules with a filter. The
generate_dq_rules_ai_assistedmethod now accepts anInputConfigobject, which allows users to specify the location and format of the input data, enabling more flexible input handling and filtering capabilities. The feature includes test cases to verify its functionality, including manual tests, unit tests, and integration tests, and the documentation has been updated with minor changes to reflect the new functionality. Additionally, the code has been modified to capitalize keywords to stabilize integration tests, and theDQGeneratorclass has been updated to accommodate the changes, allowing users to generate data quality rules from a variety of input sources. TheInputConfigclass provides a flexible way to configure the input data, including its location and format, and theget_column_metadatafunction has been introduced to retrieve column metadata from a given location. Overall, these updates aim to enhance the functionality and usability of the AI-assisted rules generation feature, providing more flexibility and accuracy in generating data quality rules. - Added case-insensitive comparison support to is_in_list and is_not_null_and_is_in_list checks (#673). The
is_in_listandis_not_null_and_is_in_listcheck functions have been enhanced to support case-insensitive comparison, allowing users to choose between case-sensitive and case-insensitive comparisons via an optionalcase_sensitiveboolean flag that defaults to True. These checks verify if values in a specified column are present in a list of allowed values, with theis_not_null_and_is_in_listcheck also requiring the values to be non-null. The updated checks provide more flexibility in data validation, enabling users to configure parameters such as the column to check, the list of allowed values, and the case sensitivity flag. However, it is recommended to use theforeign_keydataset-level check for large lists of allowed values or for columns of typeMapTypeorStructType, as these checks are not suitable for such scenarios. - Added documentation for using DQX in streaming scenarios with foreach batch (#948). Documentation and example code snippets were added to demonstrate how to apply checks in foreachBatch structured streaming function.
- Added telemetry to track count of input tables (#954). Added additional telemetry for better trakcing of DQX usage to help improve the product.
- Added support for installing DQX from private PYPI repositories (#930). The DQX library has been enhanced with support for installing DQX using a company-hosted PyPI mirror, which is necessary for enterprises that block the public PyPI index. The documentation has been added to describe the feature. The tool installation code has been modified to include new functionality for automatically upload dependencies to a workspace when internet access is blocked.
- Support Custom Folder Installation for CLI Commands (#942). The command-line interface (CLI) has been enhanced to support custom installation folders, providing users with greater flexibility when working with the library. A new
--install-folderargument has been introduced, allowing users to specify a custom installation folder when running various CLI commands, such as opening dashboards, workflows, logs, and profiles. This argument override the default installation location to support scenarios where the user installs DQX in a custom location. The library's dependency on sqlalchemy has also been updated to require a version greater than or equal to 2.0 and less than 3.0 to avoid dependency issues in older DBRs. - Enhancement to end to end tests (#921). The e2e tests has been enhanced to test integration with dbt transformation framework. Additionally, the documentation for contributing to the project and testing has been updated to simplify the setup process for running tests locally.
BREAKING CHANGES!
- Renamed
levelparameter tocriticalityingenerate_dq_rulesmethod ofDQGeneratorfor consistency. - Replaced
table: strparameter withinput_config: InputConfiginprofile_tablemethod ofDQProfilerfor greater flexibility. - Replaced
table_name: strparameter withinput_config: InputConfigingenerate_dq_rules_ai_assistedmethod ofDQGeneratorfor greater flexibility.
Contributors: @dinbab1984, @mwojtyczka, @ghanse, @vb-dbrks, @jominjohny, @AdityaMandiwal
v0.10.0
- Added Data Quality Summary Metrics (#553). The data quality engine has been enhanced with the ability to track and manage summary metrics for data quality validation, leveraging Spark's Observation feature. A new
DQMetricsObserverclass has been introduced to manage Spark observations and track summary metrics on datasets checked with the engine. TheDQEngineclass has been updated to optionally return the Spark observation associated with a given run, allowing users to access and save summary metrics. The engine now supports also writing summary metrics to a table using themetrics_configparameter, and a newsave_summary_metricsmethod has been added to save data quality summary metrics to a table. Additionally, the engine has been updated to include a uniquerun_idfield in the detailed per-row quality results, enabling cross-referencing with summary metrics. The changes also include updates to the configuration file to support the storage of summary metrics. Overall, these enhancements provide a more comprehensive and flexible data quality checking capability, allowing users to track and analyze data quality issues more effectively. - LLM assisted rules generation (#577). This release introduces a significant enhancement to the data quality rules generation process with the integration of AI-assisted rules generation using large language models (LLMs). The
DQGeneratorclass now includes agenerate_dq_rules_ai_assistedmethod, which takes user input in natural language and optionally a schema from an input table to generate data quality rules. These rules are then validated for correctness. The AI-assisted rules generation feature supports both programmatic and no-code approaches. Additionally, the feature enables the use of different LLM models and gives the possibility to use custom check functions. The release also includes various updates to the documentation, configuration files, and testing framework to support the new AI-assisted rules generation feature, ensuring a more streamlined and efficient process for defining and applying data quality rules. - Added Lakebase checks storage backend (#550). A Lakebase checks storage backend was added, allowing users to store and manage their data quality rules in a centralized lakabase table, in addition to the existing Delta table storage. The
checks_locationresolution has been updated to accommodate Lakebase, supporting both table and file storage, with flexible formatting options, including "catalog.schema.table" and "database.schema.table". The Lakebase checks storage backend is configurable through theLakebaseChecksStorageConfigclass, which includes fields for instance name, user, location, port, run configuration name, and write mode. This update provides users with more flexibility in storing and loading quality checks, ensuring that checks are saved correctly regardless of the specified location format. - Added runtime validation of sql expressions (#625). The data quality check functionality has been enhanced with runtime validation of SQL expressions, ensuring that specified fields can be resolved in the input DataFrame and that SQL expressions are valid before evaluation. If an SQL expression is invalid, the check evaluation is skipped and the results include a check failure with a descriptive message. Additionally, the configuration validation for Unity Catalog volume file paths has been improved to enforce a specific format, preventing invalid configurations and providing more informative error messages.
- Fixed docs (#598). The documentation build process has undergone significant improvements to enhance efficiency and maintainability.
- Improved Config Serialization (#676). Several updates have been made to improve the functionality, consistency, and maintainability of the codebase. The configuration loading functionality has been refactored to utilize the
ConfigSerializerclass, which handles the serialization and deserialization of workspace and run configurations. - Restore use of
hatch-fancy-pypi-readmeto fix images in PyPi (#601). The image source path for the logo in the README has been modified to correctly display the logo image when rendered, particularly on PyPi. - Skip check evaluation if columns or filter cannot be resolved in the input DataFrame (#609). DQX now skip check evaluation if columns or filters are incorrect allowing other checks to proceed even if one rule fails. The DQX engine validates specified column, columns and filter fields against the input DataFrame before applying checks, skipping evaluation and providing informative error messages if any fields are invalid.
- Updated user guide docs (#607). The documentation for quality checking and integration options has been updated to provide accurate and detailed information on supported types and approaches. Quality checking can be performed in-transit (pre-commit), validating data on the fly during processing, or at-rest, checking existing data stored in tables.
- Improved build process (#618). The hatch version has been updated to 1.15.0 to avoid compatibility issues with click version 8.3 and later, which introduced a bug affecting hatch. Additionally, the project's dependencies have been updated, including bumping the
databricks-labs-pytesterversion from 0.7.2 to 0.7.4, and code refactoring has been done to use a single Lakebase instance for all integration tests, with retry logic added to handle cases where the workspace quota limit for the number of Lakebase instances is exceeded, enhancing the testing infrastructure and improving test reliability. Furthermore, documentation updates have been made to clarify the application of quality checks to data using DQX. These changes aim to improve the efficiency, reliability, and clarity of the project's testing and documentation infrastructure.
BREAKING CHANGES!
- Added new field run_id to the detailed per-row quality results. This may or may not be a breaking change for you depending on how you leverage the results today. This is a unique run ID recorded in the summary metrics as well as detailed quality checking results to enable cross-referencing. When reusing the same DQEngine instance, the run ID stays the same. Each apply checks execution does not generate a new run ID for the same instance. It is only changed when new engine and observer (if using one) is created.
LIMITATIONS
- Saving metrics to a table requires using a classic compute cluster in Dedicated Access Mode. This limitation will be lifted observations issue is fixed in Spark Connect.
Contributors: @mwojtyczka, @ghanse, @souravg-db2, @vb-dbrks, @alexott, @tlgnr
v0.9.3
- Added support for running checks on multiple tables (#566). Added more flexibility and functionality in running data quality checks, allowing users to run checks on multiple tables in a single method call and as part of Workflows execution. Provided options to run checks for all configured run configs or for a specific run config, or for tables/views matching wildcard patterns. The CLI commands for running workflows have been updated to reflect and support these new functionalities. Additionally, new parameters have been added to configuration file to control the level of parallelism for these operations, such as
profiler_max_parallelismandquality_checker_max_parallelism. A new demo has been added to showcases how to use the profiler and apply checks across multiple tables. The changes aim to improve scalability of DQX. - Added New Row-level Checks: IPv6 Address Validation (#578). DQX now includes 2 new row-level checks: validation of IPv6 address (
is_valid_ipv6_addresscheck function), and validation if IPv6 address is within provided CIDR block (is_ipv6_address_in_cidrcheck function). - Added New Dataset-level Check: Schema Validation check (#568). The
has_valid_schemacheck function has been introduced to validate whether a DataFrame conforms to a specified schema, with results reported at the row level for consistency with other checks. This function can operate in non-strict mode, where it verifies the existence of expected columns with compatible types, or in strict mode, where it enforces an exact schema match, including column order and types. It accepts parameters such as the expected schema, which can be defined as a DDL string or a StructType object, and optional arguments to specify columns to validate and strict mode. - Added New Row-level Checks: Spatial data validations (#581). Specialized data validation checks for geospatial data have been introduced, enabling verification of valid latitude and longitude values, various geometry and geography types, such as points, linestrings, polygons, multipoints, multilinestrings, and multipolygons, as well as checks for Open Geospatial Consortium (OGC) validity, non-empty geometries, and specific dimensions or coordinate ranges. These checks are implemented as check functions, including
is_latitude,is_longitude,is_geometry,is_geography,is_point,is_linestring,is_polygon,is_multipoint,is_multilinestring,is_multipolygon,is_ogc_valid,is_non_empty_geometry,has_dimension,has_x_coordinate_between, andhas_y_coordinate_between. The addition of these geospatial data validation checks enhances the overall data quality capabilities, allowing for more accurate and reliable geospatial data processing and analysis. Running these checks requires Databricks serverless or cluster with runtime 17.1 or above. - Added absolute and relative tolerance to comparison of datasets (#574). The
compare_datasetscheck has been enhanced with the introduction of absolute and relative tolerance parameters, enabling more flexible comparisons of decimal values. These tolerances can be applied to numeric columns. - Added detailed telemetry (#561). Telemetry has been enhanced across multiple functionalities to provide better visibility into DQX usage, including which features and checks are used most frequently. This will help us focus development efforts on the areas that matter most to our users.
- Allow installation in a custom folder (#575). The installation process for the library has been enhanced to offer flexible installation options, allowing users to install the library in a custom workspace folder, in addition to the default user home directory or a global folder. When installing DQX as a workspace tool using the Databricks CLI, users are prompted to optionally specify a custom workspace path for the installation. Allowing custom installation folder makes it possible to use DQX on group assigned cluster.
- Profile subset dataframe (#589). The data profiling feature has been enhanced to allow users to profile and generate rules on a subset of the input data by introducing a filter option, which is a string SQL expression that can be used to filter the input data. This filter can be specified in the configuration file or when using the profiler, providing more flexibility in analyzing subsets of data. The profiler supports extensive configuration options to customize the profiling process, including sampling, limiting, and computing statistics on the sampled data. The new filter option enables users to generate more targeted and relevant rules, and it can be used to focus on particular segments of the data, such as rows that match certain conditions.
- Added custom exceptions (#582). The codebase now utilizes custom exceptions to handle various error scenarios, providing more specific and informative error messages compared to generic exceptions.
BREAKING CHANGES!
- Workflows run by default for all run configs from configuration file. Previously, the default behaviour was to run them for a specific run config only.
- The following depreciated methods are removed from the
DQEngine:load_checks_from_local_file,load_checks_from_workspace_file,load_checks_from_table,load_checks_from_installation,save_checks_in_local_file,save_checks_in_workspace_file,save_checks_in_table,,save_checks_in_installation,load_run_config. For loading and saving checks, users are advised to useload_checksandsave_checksof theDQEnginedescribed here, which support various storage types.
Contributors: @mwojtyczka, @ghanse, @tdikland, @Divya-Kovvuru-0802, @cornzyblack, @STEFANOVIVAS
v0.9.2
- Added performance benchmarks
(#548). Performance tests are run to ensure performance does not degrade by more than 25% by any change. Benchmark results are published in the documentation in the reference section. The benchmark covers all check functions, running all funcitons at once and applying the same funcitons at once for multiple columns using foreach column. A new performance GitHub workflow has been introduced to automate performance benchmarking, generating a new benchmark baseline, updating the existing baseline, and running performance tests to compare with the baseline. - Declare readme in the project (#547). The project configuration has been updated to include README file in the released package so that it is visible in PyPi.
- Fixed deserializing to DataFrame to assign columns properly (#559). The
deserialize_checks_to_dataframefunction has been enhanced to correctly handle columns forsql_expressionby removing the unnecessary check forDQDatasetRuleinstance and directly verifying ifdq_rule_check.columnsis notNone. - Fixed lsql dependency (#564). The lsql dependency has been updated to address a sqlglot dependency issue that arises when imported in artifacts repositories.
Contributors: @mwojtyczka @ghanse @cornzyblack @gchandra10
v0.9.1
0.9.1
- Added quality checker and end to end workflows (#519). This release introduces no-code solution for applying checks. The following workflows were added: quality-checker (apply checks and save results to tables) and end-to-end (e2e) workflows (profile input data, generate quality checks, apply the checks, save results to tables). The workflows enable quality checking for data at-rest without the need for code-level integration. It supports reference data for checks using tables (e.g., required by foreign key or compare datasets checks) as well as custom python check functions (mapping of custom check funciton to the module path in the workspace or Unity Catalog volume containing the function definition). The workflows handle one run config for each job run. Future release will introduce functionality to execute this across multiple tables. In addition, CLI commands have been added to execute the workflows. Additionaly, DQX workflows are configured now to execute using serverless clusters, with an option to use standards clusters as well. InstallationChecksStorageHandler now support absolute workspace path locations.
- Added built-in row-level check for PII detection (#486). Introduced a new built-in check for Personally Identifiable Information (PII) detection, which utilizes the Presidio framework and can be configured using various parameters, such as NLP entity recognition configuration. This check can be defined using the
does_not_contain_piicheck function and can be customized to suit specific use cases. The check requirespiiextras to be installed:pip install databricks-labs-dqx[pii]. Furthermore, a new enum classNLPEngineConfighas been introduced to define various NLP engine configurations for PII detection. Overall, these updates aim to provide more robust and customizable quality checking capabilities for detecting PII data. - Added equality row-level checks (#535). Two new row-level checks,
is_equal_toandis_not_equal_to, have been introduced to enable equality checks on column values, allowing users to verify whether the values in a specified column are equal to or not equal to a given value, which can be a numeric literal, column expression, string literal, date literal, or timestamp literal. - Added demo for Spark Structured Streaming (#518). Added demo to showcase usage of DQX with Spark Structured Streaming for in-transit data quality checking. The demo is available as Databricks notebook, and can be run on any Databricks workspace.
- Added clarification to profiler summary statistics (#523). Added new section on understanding summary statistics, which explains how these statistics are computed on a sampled subset of the data and provides a reference for the various summary statistics fields.
- Fixed rounding datetimes in the checks generator (#517). The generator has been enhanced to correctly handle midnight values when rounding "up", ensuring that datetime values already at midnight remain unchanged, whereas previously they were rounded to the next day.
- Added API Docs (#520). The DQX API documentation is generated automatically using docstrings. As part of this change the library's documentation has been updated to follow Google style.
- Improved test automation by adding end-to-end test for the asset bundles demo (#533).
BREAKING CHANGES!
ExtraParamswas moved fromdatabricks.labs.dqx.rulemodule todatabricks.labs.dqx.config
Contributors: @mwojtyczka @ghanse @renardeinside @cornzyblack @bsr-the-mngrm @dinbab1984 @AdityaMandiwal
v0.8.0
- Added new row-level freshness check (#495). A new data quality check function,
is_data_fresh, has been introduced to identify stale data resulting from delayed pipelines, enabling early detection of upstream issues. This function assesses whether the values in a specified timestamp column are within a specified number of minutes from a base timestamp column. The function takes three parameters: the column to check, the maximum age in minutes before data is considered stale, and an optional base timestamp column, defaulting to the current timestamp if not provided. - Added new dataset-level freshess check (#499). A new dataset-level check function,
is_data_fresh_per_time_window, has been added to validate whether at least a specified minimum number of records arrive within every specified time window, ensuring data freshness. This function is customizable, allowing users to define the time window, minimum records per window, and lookback period. - Improvements have been made to the performance of aggregation check functions, and the check message format has been updated for better readability.
- Created llm util function to get check functions details (#469). A new utility function has been introduced to provide definitions of all check functions, enabling the generation of prompts for Large Language Models (LLMs) to create check functions.
- Added equality safe row and column matching in compare datasets check (#473). The compare datasets check functionality has been enhanced to handle null values during row matching and column value comparisons, improving its robustness and flexibility. Two new optional parameters,
null_safe_row_matchingandnull_safe_column_value_matching, have been introduced to control how null values are handled, both defaulting to True. These parameters allow for null-safe primary key matching and column value matching, ensuring accurate comparison results even when null values are present in the data. The check now excludes specific columns from value comparison using theexclude_columnsparameter while still considering them for row matching. - Fixed datetime rounding logic in profiler (#483). The datetime rounding logic has been improved in profiler to respect the
round=Falseoption, which was previously ignored. The code now handles theOverflowErrorthat occurs when rounding up the maximum datetime value by capping the result and logging a warning. - Added loading and saving checks from file in Unity Catalog Volume (#512). This change introduces support for storing quality checks in a Unity Catalog Volume, in addition to existing storage types such as tables, files, and workspace files. The storage location of quality checks has been unified into a single configuration field called
checks_location, replacing the previouschecks_fileandchecks_tablefields, to simplify the configuration and remove ambiguity by ensuring only one storage location can be defined per run configuration. Thechecks_locationfield can point to a file in the local path, workspace, installation folder, or Unity Catalog Volume, providing users with more flexibility and clarity when managing their quality checks. - Refactored methods for loading and saving checks (#487). The
DQEngineclass has undergone significant changes to improve modularity and maintainability, including the unification of methods for loading and saving checks under theload_checksandsave_checksmethods, which take aconfigparameter to determine the storage type, such asFileChecksStorageConfig,WorkspaceFileChecksStorageConfig,TableChecksStorageConfig, orInstallationChecksStorageConfig. - Storing checks using dqx classes (#474). The data quality engine has been enhanced with methods to convert quality checks between
DQRuleobjects and Python dictionaries, allowing for flexibility in check definition and usage. Theserialize_checksmethod converts a list ofDQRuleinstances into a dictionary representation, while thedeserialize_checksmethod performs the reverse operation, converting a dictionary representation back into a list ofDQRuleinstances. Additionally, theDQRuleclass now includes ato_dictmethod to convert aDQRuleinstance into a structured dictionary, providing a standardized representation of the rule's metadata. These changes enable users to work with checks in both formats, store and retrieve checks easily, and improve the overall management and storage of data quality checks. The conversion process supports local execution and handles non-complex column expressions, although complex PySpark expressions or Python functions may not be fully reconstructable when converting from class to metadata format. - Added llm utility funciton to extract checks examples in yaml from docs (#506). This is achieved through a new Python script that extracts YAML examples from MDX documentation files and creates a combined YAML file with all the extracted examples. The script utilizes regular expressions to extract YAML code blocks from MDX content, validates each YAML block, and combines all valid blocks into a single list. The combined YAML file is then created in the LLM resources directory for use in language model processing.
BREAKING CHANGES!
- The
checks_fileandchecks_tablefields have been removed from the installation run configuration. They are now consolidated into the singlechecks_locationfield. This change simplifies the configuration and clearly defines where checks are stored. - The
load_run_configmethod has been moved toconfig_loader.RunConfigLoader, as it is not intended for direct use and falls outside theDQEnginecore responsibilities.
DEPRECIATION CHANGES!
If you are loading or saving checks from a storage (file, workspace file, table, installation), you are affected. We are deprecating the below methods. We are keeping the methods in the DQEngine but you should update your code as these methods will be removed in future versions.
- Loading checks to storage has been unified under
load_checksmethod. The following methods have been removed from theDQEngine:
load_checks_from_local_file,load_checks_from_workspace_file,load_checks_from_installation,load_checks_from_table. - Saving checks in storage has been unified under
load_checksmethod. The following methods have been removed from theDQEngine:
save_checks_in_local_file,save_checks_in_workspace_file,save_checks_in_installation,save_checks_in_table.
The save_checks and load_checks take config as a parameter, which determines the storage types used. The following storage configs are currently supported:
FileChecksStorageConfig: file in the local filesystem (YAML or JSON)WorkspaceFileChecksStorageConfig: file in the workspace (YAML or JSON)TableChecksStorageConfig: a tableInstallationChecksStorageConfig: storage defined in the installation context, using either thechecks_tableorchecks_filefield from the run configuration.
Contributors: @mwojtyczka, @karthik-ballullaya-db, @bsr-the-mngrm, @ajinkya441, @cornzyblack, @ghanse, @jominjohny, @dinbab1984
v0.7.1
- Added type validation for apply checks method (#465). The library now enforces stricter type validation for data quality rules, ensuring all elements in the checks list are instances of
DQRule. If invalid types are encountered, aTypeErroris raised with a descriptive error message, suggesting alternative methods for passing checks as dictionaries. Additionally, input attribute validation has been enhanced to verify the criticality value, which must be eitherwarnor "error", and raises aValueErrorfor invalid values. - Databricks Asset Bundle (DAB) demo (#443). A new demo showcasing the usage of DQX with DAB has been added.
- Check to compare datasets (#463). A new dataset-level check,
compare_datasets, has been introduced to compare two DataFrames at both row and column levels, providing detailed information about differences, including new or missing rows and column-level changes. This check compares only columns present in both DataFrames, excludes map type columns, and can be customized to exclude specific columns or perform a FULL OUTER JOIN to identify missing records. Thecompare_datasetscheck can be used with a reference DataFrame or table name, and its results include information about missing and extra rows, as well as a map of changed columns and their differences. - Demo on how to use DQX with dbt projects (#460). New demo has been added to showcase on how to use DQX with dbt transformation framework.
- IP V4 address validation (#464). The library has been enhanced with new checks to validate IPv4 address. Two new row checks,
is_valid_ipv4_addressandis_ipv4_address_in_cidr, have been introduced to verify whether values in a specified column are valid IPv4 addresses and whether they fall within a given CIDR block, respectively. - Improved loading checks from Delta table (#462). Loading checks from Delta tables have been improved to eliminate the need to escape string arguments, providing a more robust and user-friendly experience for working with quality checks defined in Delta tables.
Contributors: @mwojtyczka, @cornzyblack, @ghanse, @grusin-db