dbt-profiler

Add data profiling to your dbt project. Get null rates, distinct counts, min/max and more for any model or source.

Installation

dbt-profiler requires dbt >=1.1.0. Check dbt Hub for the latest installation instructions.

Supported adapters

Adapter	Supported
AWS Athena	✅
BigQuery	✅
Databricks	✅
PostgreSQL	✅
Redshift	✅
Snowflake	✅
Oracle	✅
SQL Server	✅
Apache Spark	❌
Presto	❌

dbt-profiler may work with unsupported adapters but they haven't been tested. If you've used dbt-profiler with an unsupported adapter, feedback is very welcome: open an issue, a PR, or reach out in the #tools-dbt-profiler channel on dbt Slack.

Quick start

Profile a relation and print the result to stdout:

dbt run-operation print_profile --args '{"relation_name": "customers"}'

Use a relation profile as a dbt model:

-- models/customers_profile.sql
{{ dbt_profiler.get_profile(relation=ref("customers")) }}

Generate a schema.yml skeleton with profile data embedded in column meta properties:

dbt run-operation print_profile_schema --args '{"relation_name": "customers"}'

Profile measures

A calculated profile contains the following measures for each column:

Measure	Description	Columns
`column_name`	Name of the column	all
`data_type`	Data type of the column	all
`not_null_proportion` †	Proportion of non-`NULL` values (e.g. `0.62` = 62% populated)	all
`distinct_proportion` †	Proportion of unique values	all
`distinct_count` †	Count of unique values	all
`is_unique` †	`true` if all values are unique	all
`min` *†	Minimum value	numeric, date, time
`max` *†	Maximum value	numeric, date, time
`avg` **†	Average value	numeric
`median` **†	Median value	numeric
`std_dev_population` **†	Population standard deviation	numeric
`std_dev_sample` **†	Sample standard deviation	numeric
`profiled_at`	Timestamp when the profile was calculated	all

* numeric, date and time columns only ** numeric columns only † can be excluded using the exclude_measures argument

Macros

get_profile (source)

Returns a relation profile as a SQL query that can be used in a dbt model. Handy for previewing profiles in dbt Cloud.

Arguments

Argument	Required	Default	Description
`relation`	yes		Relation object
`exclude_measures`	no	`[]`	List of measures to exclude from the profile
`include_columns`	no	`[]` (all)	Columns to include. Cannot be used together with `exclude_columns`.
`exclude_columns`	no	`[]`	Columns to exclude. Cannot be used together with `include_columns`.
`where_clause`	no		SQL `WHERE` clause to filter records before profiling
`group_by`	no	`[]`	SQL `GROUP BY` columns to aggregate data before profiling

Usage

Use with ref():

{{ dbt_profiler.get_profile(relation=ref("customers"), where_clause="is_active = true") }}

Use with source():

{{ dbt_profiler.get_profile(relation=source("jaffle_shop","customers"), exclude_measures=["std_dev_population", "std_dev_sample"]) }}

To run only in execute mode:

-- depends_on: {{ ref("customers") }}
{% if execute %}
    {{ dbt_profiler.get_profile(relation=ref("customers")) }}
{% endif %}

get_profile_table (source)

Returns a relation profile as an agate.Table. Does not print anything to stdout; intended to be called from another macro or model, not as a standalone operation.

Arguments

Argument	Required	Default	Description
`relation`	either `relation` or `relation_name`		Relation object
`relation_name`	either `relation` or `relation_name`		Relation name
`schema`	no	target schema	Schema where `relation_name` exists
`database`	no	target database	Database where `relation_name` exists
`exclude_measures`	no	`[]`	List of measures to exclude from the profile
`include_columns`	no	`[]` (all)	Columns to include. Cannot be used together with `exclude_columns`.
`exclude_columns`	no	`[]`	Columns to exclude. Cannot be used together with `include_columns`.
`where_clause`	no		SQL `WHERE` clause to filter records before profiling

Usage

{% set table = dbt_profiler.get_profile_table(relation_name="customers") %}

print_profile (source)

❗ Does not work in dbt Cloud. The profile doesn't display in the cloud console log because the underlying print_table() method is disabled.

Prints a relation profile as a Markdown table to stdout.

Arguments

Argument	Required	Default	Description
`relation`	either `relation` or `relation_name`		Relation object
`relation_name`	either `relation` or `relation_name`		Relation name
`schema`	no	target schema	Schema where `relation_name` exists
`database`	no	target database	Database where `relation_name` exists
`exclude_measures`	no	`[]`	List of measures to exclude from the profile
`include_columns`	no	`[]` (all)	Columns to include. Cannot be used together with `exclude_columns`.
`exclude_columns`	no	`[]`	Columns to exclude. Cannot be used together with `include_columns`.
`max_rows`	no	none (not truncated)	Maximum number of rows to display
`max_columns`	no	`7`	Maximum number of columns to display
`max_column_width`	no	`30`	Truncate all columns to at most this width
`max_precision`	no	none (not limited)	Maximum precision for number types
`where_clause`	no		SQL `WHERE` clause to filter records before profiling

Usage

dbt run-operation print_profile --args '{"relation_name": "customers"}'

For dbt Cloud, an alternative that logs to the console (without Markdown formatting):

{% set profile = dbt_profiler.get_profile(relation=ref("customers")) %}
{% for row in profile.rows %}
  {% do log(row.values(), info=True) %}
{% endfor %}

Example output

column_name	data_type	not_null_proportion	distinct_proportion	distinct_count	is_unique	min	max	avg	std_dev_population	std_dev_sample	profiled_at
customer_id	int64	1.00	1.00	100	1	1	100	50.50	28.87	29.01	2022-01-13 10:14:48+00
first_order	date	0.62	0.46	46	0	2018-01-01	2018-04-07				2022-01-13 10:14:48+00
most_recent_order	date	0.62	0.52	52	0	2018-01-09	2018-04-09				2022-01-13 10:14:48+00
number_of_orders	int64	0.62	0.04	4	0	1	5	1.60	0.77	0.78	2022-01-13 10:14:48+00
customer_lifetime_value	float64	0.62	0.35	35	0	1	99	26.97	18.66	18.81	2022-01-13 10:14:48+00

print_profile_schema (source)

Prints a schema.yml to stdout with all columns and their profile data embedded as meta properties.

Arguments

Argument	Required	Default	Description
`relation`	either `relation` or `relation_name`		Relation object
`relation_name`	either `relation` or `relation_name`		Relation name
`schema`	no	target schema	Schema where `relation_name` exists
`database`	no	target database	Database where `relation_name` exists
`exclude_measures`	no	`[]`	List of measures to exclude from the profile
`include_columns`	no	`[]` (all)	Columns to include. Cannot be used together with `exclude_columns`.
`exclude_columns`	no	`[]`	Columns to exclude. Cannot be used together with `include_columns`.
`model_description`	no	`""`	Model description to include in the schema
`column_description`	no	`""`	Column description to include for each column
`where_clause`	no		SQL `WHERE` clause to filter records before profiling

Usage

dbt run-operation print_profile_schema --args '{"relation_name": "customers"}'

Example output

version: 2
models:
- name: customers
  description: ''
  columns:
  - name: customer_id
    description: ''
    meta:
      data_type: int64
      row_count: 100.0
      not_null_proportion: 1.0
      distinct_proportion: 1.0
      distinct_count: 100.0
      is_unique: 1.0
      min: '1'
      max: '100'
      avg: 50.5
      std_dev_population: 28.86607004772212
      std_dev_sample: 29.01149197588202
      profiled_at: '2022-01-13 10:08:18.446822+00'
  - name: first_order
    description: ''
    meta:
      data_type: date
      row_count: 100.0
      not_null_proportion: 0.62
      distinct_proportion: 0.46
      distinct_count: 46.0
      is_unique: 0.0
      min: '2018-01-01'
      max: '2018-04-07'
      avg: null
      std_dev_population: null
      std_dev_sample: null
      profiled_at: '2022-01-13 10:08:18.446822+00'
  # ... remaining columns

This is what the profile looks like in the dbt docs UI:

print_profile_docs (source)

❗ Does not work in dbt Cloud. The profile doesn't display in the cloud console log because the underlying print_table() method is disabled.

Prints a relation profile as a Markdown table wrapped in a Jinja docs block to stdout. Intended to be used as part of the dbt docs workflow.

Arguments

Argument	Required	Default	Description
`relation`	either `relation` or `relation_name`		Relation object
`relation_name`	either `relation` or `relation_name`		Relation name
`schema`	no	target schema	Schema where `relation_name` exists
`database`	no	target database	Database where `relation_name` exists
`exclude_measures`	no	`[]`	List of measures to exclude from the profile
`include_columns`	no	`[]` (all)	Columns to include. Cannot be used together with `exclude_columns`.
`exclude_columns`	no	`[]`	Columns to exclude. Cannot be used together with `include_columns`.
`docs_name`	no	`dbt_profiler__{{ relation_name }}`	Name of the generated `docs` block
`max_rows`	no	none (not truncated)	Maximum number of rows to display
`max_columns`	no	`7`	Maximum number of columns to display
`max_column_width`	no	`30`	Truncate all columns to at most this width
`max_precision`	no	none (not limited)	Maximum precision for number types
`where_clause`	no		SQL `WHERE` clause to filter records before profiling

Usage

dbt run-operation print_profile_docs --args '{"relation_name": "customers"}'

Example output

{% docs dbt_profiler__customers %}
| column_name             | data_type | not_null_proportion | distinct_proportion | distinct_count | is_unique | min        | max        |
| ----------------------- | --------- | ------------------- | ------------------- | -------------- | --------- | ---------- | ---------- |
| customer_id             | int64     |                1.00 |                1.00 |            100 |         1 | 1          | 100        |
| first_order             | date      |                0.62 |                0.46 |             46 |         0 | 2018-01-01 | 2018-04-07 |
| most_recent_order       | date      |                0.62 |                0.52 |             52 |         0 | 2018-01-09 | 2018-04-09 |
| number_of_orders        | int64     |                0.62 |                0.04 |              4 |         0 | 1          | 5          |
| customer_lifetime_value | float64   |                0.62 |                0.35 |             35 |         0 | 1          | 99         |
{% enddocs %}

Using profiles in dbt docs

There are two ways to embed profiles in dbt docs: via meta properties (see print_profile_schema) or via doc blocks. The doc block approach is recommended because it keeps profile data out of schema.yml and lets profiles be updated independently.

Setup

1. Add a docs folder to dbt_project.yml:

model-paths: ["models", "docs"]

2. Run print_profile_docs and save the output to a file:

# docs/dbt_profiler/customers.md
dbt run-operation print_profile_docs --args '{"relation_name": "customers"}'

Note: store the output in a variable before redirecting to a file. Piping directly (e.g. dbt run-operation ... > customers.md) will empty the file before dbt compiles the project, causing an error if the doc block is already referenced. See the example update-relation-profile.sh script.

3. Reference the doc block in your model description:

version: 2

models:
  - name: customers
    description: |
      Represents a customer.

      `dbt-profiler` results:

      {{ doc("dbt_profiler__customers") }}
    columns:
      - name: customer_id
        tests:
          - not_null
          - unique

Automating profile updates with CI

The doc block approach makes it straightforward to keep profiles up to date via a scheduled CI job:

List the models to profile (e.g. dbt list --output name -m ${node_selection})
For each model, run print_profile_docs and write the output to docs/dbt_profiler/${relation_name}.md
Open a pull request for the updated profiles (e.g. using the create-pull-request GitHub Action)

Contributing

Contributions are welcome! Please read CONTRIBUTING.md before opening a pull request. For significant changes, open an issue first to discuss the approach.

You can also reach the maintainers in the #tools-dbt-profiler channel on dbt Slack.

License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
.circleci		.circleci
.github		.github
integration_tests		integration_tests
macros		macros
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
dbt_project.yml		dbt_project.yml
update-relation-profile.sh		update-relation-profile.sh

License

data-mie/dbt-profiler

Folders and files

Latest commit

History

Repository files navigation

dbt-profiler

Contents

Installation

Supported adapters

Quick start

Profile measures

Macros

get_profile (source)

Arguments

Usage

get_profile_table (source)

Arguments

Usage

print_profile (source)

Arguments

Usage

Example output

print_profile_schema (source)

Arguments

Usage

Example output

print_profile_docs (source)

Arguments

Usage

Example output

Using profiles in dbt docs

Setup

Automating profile updates with CI

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 28

Packages 0

Uh oh!

Contributors 21

Uh oh!

Languages

Packages