Is your feature request related to a problem? Please describe
Scope: Ensure a logical pipeline and warehouse structure for current and future pipeline data needs, for use in the short term.
Document any results of this discussion into their own GH Issue or ADR, as necessary.
Primary Questions:
- When to use
dbt-seeds vs source dbt-references
- Define the types of data that we may want to represent within the dbt-pipeline.
- How will the data organized within the repo? File structure confirmation for each type.
- How will the data be stored in the warehouse? Schema confirmation for each type.
Scope Creep:
- Seed data-file conventions
- Placement specific data files into a Type. Examples are added to the proposal for clarity, but may not be accurate.
- Non-Tabular data - This data can be stored in various yaml files and be available in the dbt-docs as
Markdown. i.e. Release notes, harmonization notes, etc
Describe the solution you'd like
Proposal
Possible types of lookup data:
- Type1: Study-specific lookup data - Not for use in
dbt-docs.
-- Data used for harmonization and not necessary for display within the dbt-docs.
-- i.e. Annotations
- Type2: Study-specific lookup data - For use in
dbt-docs.
-- i.e. Site information, value-sets
- **Type3: ** Study-specific lookup data - For use in
dbt-docs ONLY.
-- i.e. Dataset metadata. Harmonization metadata. Links to current documentation or sources, skimmed down version of value-set data, etc.
- Type4: Cross-project / Pipeline-specific data - Not for use in
dbt-docs.
-- Site information, study identifiers, AccessDM vocabulary tables
- Type5: Cross-project / Pipeline-specific data - For use in
dbt-docs ONLY.
-- Site information, study identifiers
When to use source dbt-references.
- This is the process used to bring
Raw Study Data into the pipeline, as a dbt-reference, using a sources.yml file.
- Use dbt-references for Type1, and Type4 data,
-- Storage - Type1: All study-specific data that is referenced via a sources.yml should be imported into the studies raw schema. i.e. inc_brainpower_raw
-- Storage - Type4: Cross-project data can be stored together.
--- Repo: models/cross_project/filename.sql
--- Schema: cross_project_raw
When to use dbt-seeds
- Use seeds when a lookup is important to outside users who want to see the csv reference in GH, dbt-docs, GH-Pages. Also, when version control is necessary, and appropriate. Requirement: Data size must fall within the dbt-seed size limitations.
- Storage and schema naming.
-- seeds/inc/inc_filename_1 > {environment}_seeds.inc_filename_1 > Type2
-- seeds/kf/kf_filename_1 > {environment}_seeds.kf_filename_1 > Type2
-- seeds/docs/filename_1 > {environment}_seeds.docs_filename_1 > Type3
-- seeds/filename_1 > {environment}_seeds.filename_1 > Type5
Describe alternatives you've considered
No response
Additional Context
No response
Is your feature request related to a problem? Please describe
Scope: Ensure a logical pipeline and warehouse structure for current and future pipeline data needs, for use in the short term.
Document any results of this discussion into their own GH Issue or ADR, as necessary.
Primary Questions:
dbt-seedsvssource dbt-referencesScope Creep:
Markdown. i.e. Release notes, harmonization notes, etcDescribe the solution you'd like
Proposal
Possible types of lookup data:
dbt-docs.-- Data used for harmonization and not necessary for display within the
dbt-docs.-- i.e. Annotations
dbt-docs.-- i.e. Site information, value-sets
dbt-docsONLY.-- i.e. Dataset metadata. Harmonization metadata. Links to current documentation or sources, skimmed down version of value-set data, etc.
dbt-docs.-- Site information, study identifiers, AccessDM vocabulary tables
dbt-docsONLY.-- Site information, study identifiers
When to use source
dbt-references.Raw Study Datainto the pipeline, as a dbt-reference, using a sources.yml file.-- Storage - Type1: All study-specific data that is referenced via a
sources.ymlshould be imported into the studies raw schema. i.e. inc_brainpower_raw-- Storage - Type4: Cross-project data can be stored together.
--- Repo: models/cross_project/filename.sql
--- Schema: cross_project_raw
When to use
dbt-seeds-- seeds/inc/inc_filename_1 > {environment}_seeds.inc_filename_1 > Type2
-- seeds/kf/kf_filename_1 > {environment}_seeds.kf_filename_1 > Type2
-- seeds/docs/filename_1 > {environment}_seeds.docs_filename_1 > Type3
-- seeds/filename_1 > {environment}_seeds.filename_1 > Type5
Describe alternatives you've considered
No response
Additional Context
No response