-
Notifications
You must be signed in to change notification settings - Fork 0
Home
The function CreateItemsetDatasets inspects a set of input tables af data and creates a group of datasets, each corresponding to a item set. Each dataset contains the records of the input tables that match the corresponding item set and is named out of it.
The main parameters of the function are:
- parameters capturing the data model of the input tables
- EAVtables a 2-level list specifying, tables in a Entity-Attribute-Value structure; each table is listed with the name of the columns representing the attribute (currently the number of columns must be 2)
- datevar (optional): a 2-level list containing, for each input table of data, the name(s) of the column(s) containing dates (only if extension="csv"), to be saved as dates in the output
- dateformat (optional): a string containing the format of the dates in the input tables of data (only if -datevar- is indicated); the string must be in one of the following: YYYYDDMM
- numericvar (optional): a 2-level list containing, for each input table of data, the name(s) of the column(s) containing numbers (only if extension="csv"), to be saved as a number in the output
- parameters capturing the item sets
- study_variable_names (list of strings): list of the study variables of interest
-
itemset (3-level list of lists): this is a list specifying which itemsets are to be retrieved for a study variable: the list has 3 levels:
- study variable (string): must be one of the strings in the list -study_variable_names-,
- table to be queried (string): specified the name of the input table of data where the attributes must be searched for,
- attribute to be selected (list of strings): attributes to be matched in the table; it can be a single column, or multiple columns
- parameters indicating how the output must be formatted
- addtabcol: a logical parameter, by default set to TRUE: if so, the columns "Table_cdm" and "Col" are added to the output, indicating respectively from which original table and column the code is taken.
- rename_col (optional) a list containing the 2-level lists to rename (for istance, id and date)
- other parameters
- verbose: a logical parameter, by default set to FALSE. If it is TRUE additional intermediate output datasets will be shown in the R environment
- discard_from_environment: a logical parameter, by default set to FALSE: if so the item set datasets are saved in the R environment
- dirinput: (optional) the directory where the input tables of data are stored. If not provided the working directory is considered
- diroutput: (optional) the directory where the output item set datasets will be saved. If not provided the working directory is considered
- extension: the extension of the input tables of data (csv and dta are supported)
One dataset per itemset. The dataset of a itemset is the union of the selections of the input data tables which match the pairs assigned to the itemset. The data model of the output is the union of the data models of the input, except for the name of the columns listed (optionally) in the option rename_col.
One single input table, named SURVEY_OBSERVATIONS.csv, contains diagnostic codes in ICD9 and READ, associated to persons:
person_id,so_date,so_source_table,so_source_column,so_source_value
P1,20131019,CAP1,GEST_ECO,39
P1,20131019,CAP1,SETTAMEN_ARSNEW,40
P2,20170112,CAP1,GEST_ECO,40
P2,20170112,CAP1,SETTAMEN_ARSNEW,40
P3,20200412,CAP1,GEST_ECO,33
P3,20200412,CAP1,SETTAMEN_ARSNEW,34
Two itemsets GESTAGE_FROM_LMP_WEEKS and GESTAGE_FROM_USOUNDS_WEEKS are specified
# the variables of interest are gestational age from last menstrual period and gestational age from ultrasound
variables_of_our_study <- c("GESTAGE_FROM_LMP_WEEKS","GESTAGE_FROM_USOUNDS_WEEKS")
itemsets_of_our_study <- vector(mode = "list")
### specification GESTAGE_FROM_LMP_WEEK
itemsets_of_our_study[["GESTAGE_FROM_LMP_WEEKS"]][["SURVEY_OBSERVATIONS"]] <- list(list("CAP1","SETTAMEN_ARSNEW"))
### specification GESTAGE_FROM_USOUNDS_WEEKS
itemsets_of_our_study[["GESTAGE_FROM_USOUNDS_WEEKS"]][["SURVEY_OBSERVATIONS"]] <- list(list("CAP1","GEST_ECO"))
### the columns of the input table where to search for the itemsets
input_EAVtables<- vector(mode = "list")
input_EAVtables[["SURVEY_OBSERVATIONS"]] <- list( "so_source_table", "so_source_column")The function is invoked:
CreateItemsetDatasets(EAVtables = input_EAVtables,
study_variable_names = variables_of_our_study,
itemset = itemsets_of_our_study,
dirinput = dirinput,
diroutput = dirtemp,
discard_from_environment = FALSE,
extension = c("csv")
)
The output is two datasets:
dataset GESTAGE_FROM_LMP_WEEK...
person_id,so_date,so_source_table,so_source_column,so_source_value
P1,20131019,CAP1,SETTAMEN_ARSNEW,40
P2,20170112,CAP1,SETTAMEN_ARSNEW,40
P3,20200412,CAP1,SETTAMEN_ARSNEW,34
...and dataset GESTAGE_FROM_USOUNDS_WEEKS
person_id,so_date,so_source_table,so_source_column,so_source_value
P1,20131019,CAP1,GEST_ECO,39
P2,20170112,CAP1,GEST_ECO,40
P3,20200412,CAP1,GEST_ECO,33
The case of two tables is described here