-
Notifications
You must be signed in to change notification settings - Fork 0
Home
- Context
- Purpose
- Structure of input data
- Structure of output data
- Arguments of the function
- Subfunctions
- Flow of the function
- How to install and use the package
- Examples
Matching is a common methodology to generate datasets where two groups of study units are balanced with respect to a set of characteristics. It is a crucial step in many observational study designs, as it provides a form of exchangeability, therefore supporting causality assessment.
While many functions are available in R to support matching, including this and this, that provide rich sets of functionalities such as effect estimation, GenerateMatchedDataset has a focus on the generation of the matched dataset itself, including the case when time-dependent variables are used to match, and is agnostic with respect to how this dataset will be used.
Bootstrapping is in essence a sequence of steps of matching. This is computationally very intensive.
GenerateMatchedDataset takes as input at least two datasets, exposed and candidate_matches. The former lists units of observations just once, while the latter may observe units of observations across time. GenerateMatchedDataset generates pairwise combinations of the units of observations, matched on a number of criteria, which may or may not involve time. The complete set of pairs may be generated, or, a sampling and/or bootstrapping strategy may be specified.
This function supports two types of matching, of increasing complexity
- on variables This is a type of matching where time is not involved: units of observations in the exposed dataset are matched to those in candidate_matches simply based on the values of a list of variables
- on variables and date this is a type of matching where time is involved. The dataset os exposed is meant to be observed on aspecific date, stored in a variable often referred to as 't0'. In the dataset of candidate matches, each record is only valid during a time interval, whose extremes are stored in two variables. This is a particular case of a dataset of time-dependent records, see Dataset of TD variables for an explanation of what these datasets are. Units of observations in the exposed dataset are matched to those in candidate_matches based on a date, and units in candidate_matches may also be included in exposed, just on a different time span.
Additionally, within each of the previous types of matching, the matching with a variable (irrespective of whether it is TI or TD) may fall into one of two types
- exact: this applies to all variable types and requests that the value of the variable in the unit in exposed is equal to the value in the unit in candidate_matches
- within a range: this only applies to numeric or date types and requests that the value of the variable in the unit in exposed falls within a range (-infalue,supvalue) centered in the value in the unit in candidate_matches
Matching can be executet naively, or, in case the datasets are large or there are memory limitations, in an iterative manner, using a threshold. The use of the threshold is described with an example here.
Finally, as an option, the function executes bootstrapping. Two strategies for bootstrapping are implemented and are described in this page.
The input datasets are the following
- exposed dataset depends on how complex the match will be and is described in the subsection Data model of the main_input dataset
- candidate_matches (mandatory): this is a dataset where each of the units of observations that are candidate matches for the units of main_input are listed, with one record per unit; the data model of this input dataset depends on how complex the match will be and is described in the subsection Data model of the candidate_matches dataset; this may be itself a time-dependent dataset (the concept of TD variables is described in the page Datasets of TD variables
The data model of the exposed dataset is specified via
- unit_of_observation (mandatory): this is the name of a variable that may be string or integer, and must have the same name and type across all input datasets
- date (non mandatory): this is the name of a date variable, it may be specified if the matching is based on a date
...
output <- GenerateMatchedDataset(exposed = ...,
candidate_matches = ...,
unit_of_observation = ...,
type of matching = ['on variables', 'on variables and date'],
[
time_variable_in_exposed = ...,
time_variables_in_candidate _matches = ...,
variables_with_exact_matching = ...,
variables_with_range_matching = ...,
range_of_variables_with_range_matching = ...,
additional_matching_rules = ...,
rule_for_matching_on_dates = ...,
output_matching = ...,
seeds_for_sampling = ...,
sample_size_per_exposed = 1,
methodology_for_bootstrapping = ["No bootstrapping", ...]
number_of_bootstrapping_samples = ...,
type_of_sampling = ...,
exclude_sameUoO = TRUE,
algorithm_for_matching = ...,
threshold = ...,
technical_details_of_matching = ...
]
)
Arguments describing the input datasets
exposed
Name of a data.table. The exposed data table is the dataset where exposed are stored. The requirements on its data model are described in this section.
candidate_matches
Name of a data.table. The candidate_matches data table is the dataset where candidate matches are stored The requirements on its data model are described in this section.
unit_of_observation
Character. It's the name of the variable included both in exposed and in candidate_matches where the value of the unit of observation is stored.
Example: unit_of_observation = "person_id",
Arguments describing the matching
type_of_matching
'on variables', 'on variables and date'
...
- currently implemented: no and set to 'on variables and date'
time_variable_in_exposed
Specifies the name of the variable in the exposed dataset where the value of t0 is stored. Only specified if type_of_matching is 'on variables and date'.
time_variables_in_candidate_matches
Specifies the pair of variables in the candidate_matches dataset that identify the interval of validity of the record. The first variable of the pair is the start of the interval, and the second the end, therefore the first must always be before or equal to the second. Moreover, two observations with the same value of UoO must have non-overlapping intervals.
Only specified if type_of_matching is 'on variables and date'.
- In the current implementation this is not available, instead a specification must be added in range_of_variables_with_range_matching
variables_with_exact_matching
Vector of characters. This is the list of variables included both in exposed and in candidate_matches where the matching is requested to be exact. They can be of any type.
variables_with_range_matching
Vector of characters. This is the list of variables included both in exposed and in candidate_matches where the matching is requested to be with a range. The range is futher specified in range_of_variables_with_range_matching, which cannot be missing is this argument is specified. They must be numeric or dates.
range_of_variables_with_range_matching
List of pairs of numbers, as many as the elements of the vector variables_with_range_matching. If variables_with_range_matching is specified, then range_of_variables_with_range_matching must be specified as well: the first element of the pair specifies the lower extreme of the range to be subtracted from the coresponding element in variables_with_range_matching, and the second identifies the upper extreme, to be added to the corresponding element in variables_with_range_matching.
Example: range_of_variables_with_range_matching = list(c(1,1),c(4,5.5))
rule_for_matching_on_dates
Character. The options are the following:
-
'exact': the value of time_variable_in_exposed must be included between the extremes of the interval identified by time_variables_in_candidate_matches
-
'with range': the value of time_variable_in_exposed with a range, specified ..., must overlap the interval identified by time_variables_in_candidate_matches
-
'historical': the value of time_variable_in_exposed minus a time, specified ..., must be included between the extremes of the interval identified by time_variables_in_candidate_matches
-
'historical with range': the value of time_variable_in_exposed with a range, specified ..., minus a time, , specified ..., must overlap the interval identified by time_variables_in_candidate_matches
-
In the current implementation only 'exact' matching of dates is available
exclude_sameUoO
Boolean, default 'TRUE'. If this is TRUE, and exposed cannot be matched to a candidate match with the same value of the variable unit_of_observation.
algorithm_for_matching
String. It represents the mechanism to produce tha matching and/or each step of the bootstrapping. The possible values are
- 'naive': it means that matching is attempted directly
- 'with threshold': it means that matching is done in batches, attempting to avoid that datasets with number of rows larger than a specified threshold are ever created. If this si specified, threshold needs to be specified as well.
threshold
Integer. Available if algorithm_for_matching is set to 'with theshold'. This represents the maximum number of records that will be produced during a step of the matching/bootstrapping. This maximum will be estimated and the procedure will be split in batches in such a way that this threshold is not attained. The procedure is explained with an example here. If this is not specified, the procedure will not be split in batches.
Example: threshold = 10000000,
seeds_for_sampling
This is a vector of numbers that will be used in the random steps of the procedure. The vector must be ... long.
technical_details_of_matching
...
output_matching
...
Arguments describing the sampling
sample_size_per_exposed
If all matched records are to be saved: this must be set to N If sampling of matched records is requested: this is an integer, storing the number of matches that must be sampled for each exposed.
type_of_sampling
'with replacement', 'without replacement', ...
Arguments describing the bootstrapping
methodology_for_bootstrapping
The default for this argument is "No bootstrapping". Other options are "Sample exposed" and "Sample units of observations", and are decsribed here.
- In the current implementation only 'Sample units of observations' is available, and is default if number_of_bootstrapping_samples is specified
number_of_bootstrapping_samples
This argument is only allowed if methodology_for_bootstrapping is different from "No bootstrapping"
Auxiliary arguments
temporary_folder
Name of the folder where temporary files are stored.
...
...
...
...
...
...
...
...