Skip to content

Discussion Request: Lookup-Data - Repo and Warehouse Storage #34

@brendagutman

Description

@brendagutman

Is your feature request related to a problem? Please describe

Scope: Ensure a logical pipeline and warehouse structure for current and future pipeline data needs, for use in the short term.

Document any results of this discussion into their own GH Issue or ADR, as necessary.

Primary Questions:

  1. When to use dbt-seeds vs source dbt-references
  2. Define the types of data that we may want to represent within the dbt-pipeline.
  3. How will the data organized within the repo? File structure confirmation for each type.
  4. How will the data be stored in the warehouse? Schema confirmation for each type.

Scope Creep:

  1. Seed data-file conventions
  2. Placement specific data files into a Type. Examples are added to the proposal for clarity, but may not be accurate.
  3. Non-Tabular data - This data can be stored in various yaml files and be available in the dbt-docs as Markdown. i.e. Release notes, harmonization notes, etc

Describe the solution you'd like

Proposal

Possible types of lookup data:

  • Type1: Study-specific lookup data - Not for use in dbt-docs.
    -- Data used for harmonization and not necessary for display within the dbt-docs.
    -- i.e. Annotations
  • Type2: Study-specific lookup data - For use in dbt-docs.
    -- i.e. Site information, value-sets
  • **Type3: ** Study-specific lookup data - For use in dbt-docs ONLY.
    -- i.e. Dataset metadata. Harmonization metadata. Links to current documentation or sources, skimmed down version of value-set data, etc.
  • Type4: Cross-project / Pipeline-specific data - Not for use in dbt-docs.
    -- Site information, study identifiers, AccessDM vocabulary tables
  • Type5: Cross-project / Pipeline-specific data - For use in dbt-docs ONLY.
    -- Site information, study identifiers

When to use source dbt-references.

  • This is the process used to bring Raw Study Data into the pipeline, as a dbt-reference, using a sources.yml file.
  • Use dbt-references for Type1, and Type4 data,
    -- Storage - Type1: All study-specific data that is referenced via a sources.yml should be imported into the studies raw schema. i.e. inc_brainpower_raw
    -- Storage - Type4: Cross-project data can be stored together.
    --- Repo: models/cross_project/filename.sql
    --- Schema: cross_project_raw

When to use dbt-seeds

  • Use seeds when a lookup is important to outside users who want to see the csv reference in GH, dbt-docs, GH-Pages. Also, when version control is necessary, and appropriate. Requirement: Data size must fall within the dbt-seed size limitations.
  • Storage and schema naming.
    -- seeds/inc/inc_filename_1 > {environment}_seeds.inc_filename_1 > Type2
    -- seeds/kf/kf_filename_1 > {environment}_seeds.kf_filename_1 > Type2
    -- seeds/docs/filename_1 > {environment}_seeds.docs_filename_1 > Type3
    -- seeds/filename_1 > {environment}_seeds.filename_1 > Type5

Describe alternatives you've considered

No response

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions