Skip to content
Olga Paoletti edited this page Jul 19, 2024 · 21 revisions

CreateConceptSetDatasets

Context

A concept set is a set of medical concepts (eg the concept set "Diabetes" may contain the two concepts "type 2 diabetes" and "type 1 diabetes"). Each concept can be projected to codes in multiple coding systems (for instance, "ICD10", or "ATC"). Each concept set is associated to a data domain in the medical field (eg "diagnosis" or "medication") (Avillach et al, JAMIA 2013). CreateConceptSetDatasets is useful when records corresponding to a collection of concept sets must be retrieved from (at least) one data source composed of data tables, each associated to several data domains, and each containing at least a column of medical codes. This is a circumstance that occurs regularly in the context of multi-database studies. Retrieving records corresponding to concept sets is the first step in the process of creating study variables based on the data sources (named T2 in Gini et al, eGEMS 2016), and is facilitated when the data sources are converted into a same common data model (Gini et al, CPT 2020). However, the function can be used to support multiple common data models, see below.

Purpose

The function CreateConceptSetDatasets inspects a set of input tables af data and creates a group of datasets, each corresponding to a concept set. Each dataset contains the records of the input tables that match the corresponding concept set and is named out of it. The data model of the input tables, the concept sets, their domains and the associated codes are listed as parameters of the function, in the format of multi-level lists.

Structure of input data

Input is a set of data tables that contain healthcare data, each pertaining to one or more data domains. Each dataset must have at least one column containing codes in a coding system, such as ICD9 or ATC.

Parameters of the function

  • parameters capturing the data model of the input tables

    • dataset a 2-level list containing, for each domain, the names of the corresponding tables of data
    • codvar a 2 level list containing, for each table of data, the name of the column containing the codes of interest. The character "*" inside a code is taken as a wildcard.
    • datevar (optional): a 2-level list containing, for each input table of data, the name(s) of the column(s) containing dates (only if extension=”csv”), to be saved as dates in the output
    • EAVtables (optional): a 2-level list specifying, for each domain, tables in a Entity-Attribute-Value structure; each table is listed with the name of two columns: the one contaning attributes and the one containing values
    • EAVattributes (optional): a 3-level list specifying, for each domain and table in a Entity-Attribute-Value structure, the attributes whose values should be browsed to retrieve codes belonging to that domain; each attribute is listed along with its coding system
    • dateformat (optional): a string containing the format of the dates in the input tables of data (only if -datevar- is indicated); the string must be in one of the following: YYMMDD, yymmdd, YYMMDD, DDMMYY, YYYYMMDD
    • vocabulary (optional) a 3-level list containing, for each table of data and data domain, the name of the column containing the vocabulary of the column(s) -codvar-
    • extension the extension of the input tables of data (csv and dta are supported)
  • parameters capturing the concept sets

    • concept_set_domains a 2-level list containing, for each concept set, the corresponding domain
    • concept_set_codes a 3-level list containing, for each concept set, for each coding system, the list of the corresponding codes to be used as inclusion criteria for records: records must be included if the their code(s) starts with at least one string in this list; the match is executed ignoring points, unless the option vocabularies_with_dot_wildcard is specified (see below)
    • concept_set_codes_excl (optional) a 3-level list containing, for each concept set, for each coding system, the list of the corresponding codes to be used as exclusion criteria for records: records must be excluded if the their code(s) starts with at least one string in this list; the match is executed ignoring points
    • concept_set_names (optional) a vector containing the names of the concept sets to be processed; if this is missing, all the concept sets included in the previous lists are processed
    • vocabularies_with_dot_wildcard (optional) a list containing the coding systems where the dot '.' must be considered a wildcard (by default, dots are ignored)
    • vocabularies_with_keep_dot (optional) a list containing the coding systems where the dot '.' must be considered as itself
    • vocabularies_with_exact_search (optional) a list containing the vocabularies in which the codes must match exactly
  • parameters indicating how the output must be formatted

    • rename_col (optional) this is a list of 3-level lists; each 3-level list contains a column name for each input table of data (associated to a data domain) to be renamed in the output (for instance: the personal identifier, or the date); in the output all the columns will be renamed with the name of the list.
    • addtabcol a logical parameter, by default set to TRUE: if so, the columns "Table_cdm" and "Col" are added to the output, indicating respectively from which original table and column the code is taken.
  • other parameters

    • verbose a logical parameter, by default set to FALSE. If it is TRUE additional intermediate output datasets will be shown in the R environment
    • discard_from_environment (optional) a logical parameter, by default set to FALSE. If it is TRUE, the output datasets are removed from the global environment
    • filter_expression (optional) this is a 2-level lists: this is a logical condition in the columns that are specified in -rename_col-. This conditions is to be used to filter the input datasets before starting to filter the concept sets
    • dirinput (optional) the directory where the input tables of data are stored. If not provided the working directory is considered.
    • diroutput (optional) the directory where the output concept sets datasets will be saved. If not provided the working directory is considered.
    • output_extension (optional) the extension of the output tables of data (RData is the default. Also rds and qs are supported). In case the chosen extension is qs, final datasets are compressed and computation time decrease
    • add_conceptset_name (optional) to add a column with the name of the conceptset

Structure of output

One dataset per concept set. The dataset of a concept set is the union of the selections of the input data tables which match the codes in the concept set. The data model of the output is the union of the data models of the input, except for the name of the columns listed in the option codvar and (optionally) in the option rename_col.

Example 1

One single input table, named EVENTS.csv, contains diagnostic codes in ICD9 and READ, associated to persons:

person_id,start_date_record,event_code,event_record_vocabulary
PERSON0001,20061020,242,ICD9
PERSON0002,20100310,242.0,ICD9
PERSON0003,20130620,153,ICD9
PERSON0004,20130620,153.4,ICD9
PERSON0005,20130731,153.2,ICD9
PERSON0006,20170509,F27sb,READ
PERSON0007,20050817,F27.b0,READ
PERSON0008,20170417,F274b,READ
PERSON0009,20050717,F245.001,READ
PERSON0010,20050510,F2r4,READ
PERSON0010,20050510,F2r4.r5,READ
PERSON0010,20050510,F24..,READ

Two concept sets A and B are specified, both in the domain 'Diagnosis':

concept_sets_of_our_study <- c("A","B")

concept_set_domains<- vector(mode = "list")
concept_set_domains[["A"]] = "Diagnosis"
concept_set_domains[["B"]] = "Diagnosis"

concept_set_codes_our_study<- vector(mode = "list")
concept_set_codes_our_study[["A"]][["ICD9"]] <- c("242")
concept_set_codes_our_study[["A"]][["READ"]] <- c("F27sb","F27.b.")

concept_set_codes_our_study[["B"]][["ICD9"]] <- c("1534")
concept_set_codes_our_study[["B"]][["READ"]] <- c("F2.4")

The data model of the input table is stored in the appropriate lists:

INPUT_tables <- vector(mode = "list")
INPUT_tables[["Diagnosis"]] <- c("EVENTS")

INPUT_codvar <- vector(mode="list")
INPUT_codvar[["Diagnosis"]][["EVENTS"]] = "event_code"

INPUT_coding_system_cols<- vector(mode = "list")
INPUT_coding_system_cols[["Diagnosis"]][["EVENTS"]] = "event_record_vocabulary"

The function is invoked:

CreateConceptSetDatasets(dataset = INPUT_tables,
                         codvar = INPUT_codvar,
                         vocabulary = INPUT_coding_system_cols,
                         concept_set_domains = concept_set_domains,
                         concept_set_codes = concept_set_codes_our_study,
                         concept_set_names = concept_sets_of_our_study,
                         dirinput = dirmyinput,
                         diroutput = dirmyoutput,
                         extension = c("csv"),
                         vocabularies_with_dot_wildcard = c("READ")
)

The output is two datasets:

dataset A...

person_id,start_date_record,event_code,event_record_vocabulary
PERSON0001,20061020,242,ICD9
PERSON0002,20100310,242.0,ICD9
PERSON0006,20170509,F27sb,READ
PERSON0007,20050817,F27.b0,READ

...and dataset B

person_id,start_date_record,event_code,event_record_vocabulary
PERSON0004,20130620,153.4,ICD9
PERSON0010,20050510,F2r4,READ
PERSON0010,20050510,F2r4.r5,READ

Example 2

A more complex example can be found here, where the input are multiple tables converted to the ConcePTION Common Data Model.

Note on the use of this function to support multiple common data models in a same script

A script using this function to retrieve concept set datasets from two common data models can be structured as follows

  • assign once and for all the parameters capturing the concept sets
  • assign two families of parameters capturing the data model of the input tables, one per common data model
  • the function must be called conditional on the common data model of the input tables, that can be retrieved from metadata or from the user