Skip to content

data-mie/dbt-profiler

Repository files navigation

CircleCI License dbt Hub

dbt-profiler

Add data profiling to your dbt project. Get null rates, distinct counts, min/max and more for any model or source.

Contents

Installation

dbt-profiler requires dbt >=1.1.0. Check dbt Hub for the latest installation instructions.

Supported adapters

Adapter Supported
AWS Athena
BigQuery
Databricks
PostgreSQL
Redshift
Snowflake
Oracle
SQL Server
Apache Spark
Presto

dbt-profiler may work with unsupported adapters but they haven't been tested. If you've used dbt-profiler with an unsupported adapter, feedback is very welcome: open an issue, a PR, or reach out in the #tools-dbt-profiler channel on dbt Slack.

Quick start

Profile a relation and print the result to stdout:

dbt run-operation print_profile --args '{"relation_name": "customers"}'

Use a relation profile as a dbt model:

-- models/customers_profile.sql
{{ dbt_profiler.get_profile(relation=ref("customers")) }}

Generate a schema.yml skeleton with profile data embedded in column meta properties:

dbt run-operation print_profile_schema --args '{"relation_name": "customers"}'

Profile measures

A calculated profile contains the following measures for each column:

Measure Description Columns
column_name Name of the column all
data_type Data type of the column all
not_null_proportion Proportion of non-NULL values (e.g. 0.62 = 62% populated) all
distinct_proportion Proportion of unique values all
distinct_count Count of unique values all
is_unique true if all values are unique all
min *† Minimum value numeric, date, time
max *† Maximum value numeric, date, time
avg **† Average value numeric
median **† Median value numeric
std_dev_population **† Population standard deviation numeric
std_dev_sample **† Sample standard deviation numeric
profiled_at Timestamp when the profile was calculated all

* numeric, date and time columns only ** numeric columns only † can be excluded using the exclude_measures argument

Macros

get_profile (source)

Returns a relation profile as a SQL query that can be used in a dbt model. Handy for previewing profiles in dbt Cloud.

Arguments

Argument Required Default Description
relation yes Relation object
exclude_measures no [] List of measures to exclude from the profile
include_columns no [] (all) Columns to include. Cannot be used together with exclude_columns.
exclude_columns no [] Columns to exclude. Cannot be used together with include_columns.
where_clause no SQL WHERE clause to filter records before profiling
group_by no [] SQL GROUP BY columns to aggregate data before profiling

Usage

Use with ref():

{{ dbt_profiler.get_profile(relation=ref("customers"), where_clause="is_active = true") }}

Use with source():

{{ dbt_profiler.get_profile(relation=source("jaffle_shop","customers"), exclude_measures=["std_dev_population", "std_dev_sample"]) }}

To run only in execute mode:

-- depends_on: {{ ref("customers") }}
{% if execute %}
    {{ dbt_profiler.get_profile(relation=ref("customers")) }}
{% endif %}

get_profile_table (source)

Returns a relation profile as an agate.Table. Does not print anything to stdout; intended to be called from another macro or model, not as a standalone operation.

Arguments

Argument Required Default Description
relation either relation or relation_name Relation object
relation_name either relation or relation_name Relation name
schema no target schema Schema where relation_name exists
database no target database Database where relation_name exists
exclude_measures no [] List of measures to exclude from the profile
include_columns no [] (all) Columns to include. Cannot be used together with exclude_columns.
exclude_columns no [] Columns to exclude. Cannot be used together with include_columns.
where_clause no SQL WHERE clause to filter records before profiling

Usage

{% set table = dbt_profiler.get_profile_table(relation_name="customers") %}

print_profile (source)

Does not work in dbt Cloud. The profile doesn't display in the cloud console log because the underlying print_table() method is disabled.

Prints a relation profile as a Markdown table to stdout.

Arguments

Argument Required Default Description
relation either relation or relation_name Relation object
relation_name either relation or relation_name Relation name
schema no target schema Schema where relation_name exists
database no target database Database where relation_name exists
exclude_measures no [] List of measures to exclude from the profile
include_columns no [] (all) Columns to include. Cannot be used together with exclude_columns.
exclude_columns no [] Columns to exclude. Cannot be used together with include_columns.
max_rows no none (not truncated) Maximum number of rows to display
max_columns no 7 Maximum number of columns to display
max_column_width no 30 Truncate all columns to at most this width
max_precision no none (not limited) Maximum precision for number types
where_clause no SQL WHERE clause to filter records before profiling

Usage

dbt run-operation print_profile --args '{"relation_name": "customers"}'

For dbt Cloud, an alternative that logs to the console (without Markdown formatting):

{% set profile = dbt_profiler.get_profile(relation=ref("customers")) %}
{% for row in profile.rows %}
  {% do log(row.values(), info=True) %}
{% endfor %}

Example output

column_name data_type not_null_proportion distinct_proportion distinct_count is_unique min max avg std_dev_population std_dev_sample profiled_at
customer_id int64 1.00 1.00 100 1 1 100 50.50 28.87 29.01 2022-01-13 10:14:48+00
first_order date 0.62 0.46 46 0 2018-01-01 2018-04-07 2022-01-13 10:14:48+00
most_recent_order date 0.62 0.52 52 0 2018-01-09 2018-04-09 2022-01-13 10:14:48+00
number_of_orders int64 0.62 0.04 4 0 1 5 1.60 0.77 0.78 2022-01-13 10:14:48+00
customer_lifetime_value float64 0.62 0.35 35 0 1 99 26.97 18.66 18.81 2022-01-13 10:14:48+00

print_profile_schema (source)

Prints a schema.yml to stdout with all columns and their profile data embedded as meta properties.

Arguments

Argument Required Default Description
relation either relation or relation_name Relation object
relation_name either relation or relation_name Relation name
schema no target schema Schema where relation_name exists
database no target database Database where relation_name exists
exclude_measures no [] List of measures to exclude from the profile
include_columns no [] (all) Columns to include. Cannot be used together with exclude_columns.
exclude_columns no [] Columns to exclude. Cannot be used together with include_columns.
model_description no "" Model description to include in the schema
column_description no "" Column description to include for each column
where_clause no SQL WHERE clause to filter records before profiling

Usage

dbt run-operation print_profile_schema --args '{"relation_name": "customers"}'

Example output

version: 2
models:
- name: customers
  description: ''
  columns:
  - name: customer_id
    description: ''
    meta:
      data_type: int64
      row_count: 100.0
      not_null_proportion: 1.0
      distinct_proportion: 1.0
      distinct_count: 100.0
      is_unique: 1.0
      min: '1'
      max: '100'
      avg: 50.5
      std_dev_population: 28.86607004772212
      std_dev_sample: 29.01149197588202
      profiled_at: '2022-01-13 10:08:18.446822+00'
  - name: first_order
    description: ''
    meta:
      data_type: date
      row_count: 100.0
      not_null_proportion: 0.62
      distinct_proportion: 0.46
      distinct_count: 46.0
      is_unique: 0.0
      min: '2018-01-01'
      max: '2018-04-07'
      avg: null
      std_dev_population: null
      std_dev_sample: null
      profiled_at: '2022-01-13 10:08:18.446822+00'
  # ... remaining columns

This is what the profile looks like in the dbt docs UI:

dbt docs example


print_profile_docs (source)

Does not work in dbt Cloud. The profile doesn't display in the cloud console log because the underlying print_table() method is disabled.

Prints a relation profile as a Markdown table wrapped in a Jinja docs block to stdout. Intended to be used as part of the dbt docs workflow.

Arguments

Argument Required Default Description
relation either relation or relation_name Relation object
relation_name either relation or relation_name Relation name
schema no target schema Schema where relation_name exists
database no target database Database where relation_name exists
exclude_measures no [] List of measures to exclude from the profile
include_columns no [] (all) Columns to include. Cannot be used together with exclude_columns.
exclude_columns no [] Columns to exclude. Cannot be used together with include_columns.
docs_name no dbt_profiler__{{ relation_name }} Name of the generated docs block
max_rows no none (not truncated) Maximum number of rows to display
max_columns no 7 Maximum number of columns to display
max_column_width no 30 Truncate all columns to at most this width
max_precision no none (not limited) Maximum precision for number types
where_clause no SQL WHERE clause to filter records before profiling

Usage

dbt run-operation print_profile_docs --args '{"relation_name": "customers"}'

Example output

{% docs dbt_profiler__customers %}
| column_name             | data_type | not_null_proportion | distinct_proportion | distinct_count | is_unique | min        | max        |
| ----------------------- | --------- | ------------------- | ------------------- | -------------- | --------- | ---------- | ---------- |
| customer_id             | int64     |                1.00 |                1.00 |            100 |         1 | 1          | 100        |
| first_order             | date      |                0.62 |                0.46 |             46 |         0 | 2018-01-01 | 2018-04-07 |
| most_recent_order       | date      |                0.62 |                0.52 |             52 |         0 | 2018-01-09 | 2018-04-09 |
| number_of_orders        | int64     |                0.62 |                0.04 |              4 |         0 | 1          | 5          |
| customer_lifetime_value | float64   |                0.62 |                0.35 |             35 |         0 | 1          | 99         |
{% enddocs %}

Using profiles in dbt docs

There are two ways to embed profiles in dbt docs: via meta properties (see print_profile_schema) or via doc blocks. The doc block approach is recommended because it keeps profile data out of schema.yml and lets profiles be updated independently.

Setup

1. Add a docs folder to dbt_project.yml:

model-paths: ["models", "docs"]

2. Run print_profile_docs and save the output to a file:

# docs/dbt_profiler/customers.md
dbt run-operation print_profile_docs --args '{"relation_name": "customers"}'

Note: store the output in a variable before redirecting to a file. Piping directly (e.g. dbt run-operation ... > customers.md) will empty the file before dbt compiles the project, causing an error if the doc block is already referenced. See the example update-relation-profile.sh script.

3. Reference the doc block in your model description:

version: 2

models:
  - name: customers
    description: |
      Represents a customer.

      `dbt-profiler` results:

      {{ doc("dbt_profiler__customers") }}
    columns:
      - name: customer_id
        tests:
          - not_null
          - unique

Automating profile updates with CI

The doc block approach makes it straightforward to keep profiles up to date via a scheduled CI job:

  1. List the models to profile (e.g. dbt list --output name -m ${node_selection})
  2. For each model, run print_profile_docs and write the output to docs/dbt_profiler/${relation_name}.md
  3. Open a pull request for the updated profiles (e.g. using the create-pull-request GitHub Action)

Contributing

Contributions are welcome! Please read CONTRIBUTING.md before opening a pull request. For significant changes, open an issue first to discuss the approach.

You can also reach the maintainers in the #tools-dbt-profiler channel on dbt Slack.

License

Apache License 2.0

About

Add data profiling to your dbt project.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 21

Languages