Home

Context

Matching is a common methodology to generate datasets where two groups of study units are balanced with respect to a set of characteristics. It is a crucial step in many observational study designs, as it provides a form of exchangeability, therefore supporting causality assessment.

While many functions are available in R to support matching, including this and this, that provide rich sets of functionalities such as effect estimation, GenerateMatchedDataset has a focus on the generation of the matched dataset itself, including the case when time-dependent variables are used to match, and is agnostic with respect to how this dataset will be used.

Bootstrapping is in essence a sequence of steps of matching. This is computationally very intensive.

Purpose

GenerateMatchedDataset takes as input at least two datasets, exposed and candidate_matches. The former lists units of observations just once, while the latter may observe units of observations across time. GenerateMatchedDataset generates pairwise combinations of the units of observations, matched on a number of criteria, which may or may not involve time. The complete set of pairs may be generated, or, a sampling and/or bootstrapping strategy may be specified.

This function supports two types of matching, of increasing complexity

on variables This is a type of matching where time is not involved: units of observations in the exposed dataset are matched to those in candidate_matches simply based on the values of a list of variables
on variables and date this is a type of matching where time is involved. The dataset os exposed is meant to be observed on aspecific date, stored in a variable often referred to as 't0'. In the dataset of candidate matches, each record is only valid during a time interval, whose extremes are stored in two variables. This is a particular case of a dataset of time-dependent records, see Dataset of TD variables for an explanation of what these datasets are. Units of observations in the exposed dataset are matched to those in candidate_matches based on a date, and units in candidate_matches may also be included in exposed, just on a different time span.

Additionally, within each of the previous types of matching, the matching with a variable (irrespective of whether it is TI or TD) may fall into one of two types

exact: this applies to all variable types and requests that the value of the variable in the unit in exposed is equal to the value in the unit in candidate_matches
within a range: this only applies to numeric or date types and requests that the value of the variable in the unit in exposed falls within a range (-infalue,supvalue) centered in the value in the unit in candidate_matches

Matching can be executet naively, or, in case the datasets are large or there are memory limitations, in an iterative manner, using a threshold. The use of the threshold is described with an example here.

Finally, as an option, the function executes bootstrapping. Two strategies for bootstrapping are implemented and are described in this page.

Structure of input data

The input datasets are the following

exposed dataset depends on how complex the match will be and is described in the subsection Data model of the main_input dataset
candidate_matches (mandatory): this is a dataset where each of the units of observations that are candidate matches for the units of main_input are listed, with one record per unit; the data model of this input dataset depends on how complex the match will be and is described in the subsection Data model of the candidate_matches dataset; this may be itself a time-dependent dataset (the concept of TD variables is described in the page Datasets of TD variables

Data model of the exposed dataset

The data model of the exposed dataset is specified via

unit_of_observation (mandatory): this is the name of a variable that may be string or integer, and must have the same name and type across all input datasets
date (non mandatory): this is the name of a date variable, it may be specified if the matching is based on a date

Data model of the candidate_matches dataset

...

Back to the Wiki Contents

Arguments of the function

output <- GenerateMatchedDataset(exposed = ...,
                                candidate_matches = ...,
                                unit_of_observation = ...,
                                type of matching = ['on variables', 'on variables and date'],  
				[
                                time_variable_in_exposed = ...,
                                time_variables_in_candidate _matches = ...,
                                variables_with_exact_matching = ...,
                                variables_with_range_matching = ...,
                                range_of_variables_with_range_matching = ..., 
                                additional_matching_rules = ...,                              
                                rule_for_matching_on_dates = ...,                              
                                output_matching = ...,    
                                seeds_for_sampling = ...,                            
                                sample_size_per_exposed = 1,
                                methodology_for_bootstrapping = ["No bootstrapping", ...]                                
                                number_of_bootstrapping_samples = ...,                                
                                type_of_sampling = ...,
                                exclude_sameUoO = TRUE,
                                algorithm_for_matching = ...,
                                threshold = ...,
                                technical_details_of_matching = ...
                                ]
                    )

Back to the Wiki Contents

Arguments describing the input datasets

exposed

Name of a data.table. The exposed data table is the dataset where exposed are stored. The requirements on its data model are described in this section.

candidate_matches

Name of a data.table. The candidate_matches data table is the dataset where candidate matches are stored The requirements on its data model are described in this section.

unit_of_observation

Character. It's the name of the variable included both in exposed and in candidate_matches where the value of the unit of observation is stored.

Example: unit_of_observation = "person_id",

Arguments describing the matching

type_of_matching

'on variables', 'on variables and date'

...

currently implemented: no and set to 'on variables and date'

time_variable_in_exposed

Specifies the name of the variable in the exposed dataset where the value of t0 is stored. Only specified if type_of_matching is 'on variables and date'.

time_variables_in_candidate_matches

Specifies the pair of variables in the candidate_matches dataset that identify the interval of validity of the record. The first variable of the pair is the start of the interval, and the second the end, therefore the first must always be before or equal to the second. Moreover, two observations with the same value of UoO must have non-overlapping intervals.

Only specified if type_of_matching is 'on variables and date'.

In the current implementation this is not available, instead a specification must be added in range_of_variables_with_range_matching

variables_with_exact_matching

Vector of characters. This is the list of variables included both in exposed and in candidate_matches where the matching is requested to be exact. They can be of any type.

variables_with_range_matching

Vector of characters. This is the list of variables included both in exposed and in candidate_matches where the matching is requested to be with a range. The range is futher specified in range_of_variables_with_range_matching, which cannot be missing is this argument is specified. They must be numeric or dates.

range_of_variables_with_range_matching

List of pairs of numbers, as many as the elements of the vector variables_with_range_matching. If variables_with_range_matching is specified, then range_of_variables_with_range_matching must be specified as well: the first element of the pair specifies the lower extreme of the range to be subtracted from the coresponding element in variables_with_range_matching, and the second identifies the upper extreme, to be added to the corresponding element in variables_with_range_matching.

Example: range_of_variables_with_range_matching = list(c(1,1),c(4,5.5))

rule_for_matching_on_dates

Character. The options are the following:

'exact': the value of time_variable_in_exposed must be included between the extremes of the interval identified by time_variables_in_candidate_matches
'with range': the value of time_variable_in_exposed with a range, specified ..., must overlap the interval identified by time_variables_in_candidate_matches
'historical': the value of time_variable_in_exposed minus a time, specified ..., must be included between the extremes of the interval identified by time_variables_in_candidate_matches
'historical with range': the value of time_variable_in_exposed with a range, specified ..., minus a time, , specified ..., must overlap the interval identified by time_variables_in_candidate_matches
In the current implementation only 'exact' matching of dates is available

exclude_sameUoO

Boolean, default 'TRUE'. If this is TRUE, and exposed cannot be matched to a candidate match with the same value of the variable unit_of_observation.

algorithm_for_matching

String. It represents the mechanism to produce tha matching and/or each step of the bootstrapping. The possible values are

'naive': it means that matching is attempted directly
'with threshold': it means that matching is done in batches, attempting to avoid that datasets with number of rows larger than a specified threshold are ever created. If this si specified, threshold needs to be specified as well.

threshold

Integer. Available if algorithm_for_matching is set to 'with theshold'. This represents the maximum number of records that will be produced during a step of the matching/bootstrapping. This maximum will be estimated and the procedure will be split in batches in such a way that this threshold is not attained. The procedure is explained with an example here. If this is not specified, the procedure will not be split in batches.

Example: threshold = 10000000,

seeds_for_sampling

This is a vector of numbers that will be used in the random steps of the procedure. The vector must be ... long.

technical_details_of_matching

...

output_matching

...

Arguments describing the sampling

sample_size_per_exposed

If all matched records are to be saved: this must be set to N If sampling of matched records is requested: this is an integer, storing the number of matches that must be sampled for each exposed.

type_of_sampling

'with replacement', 'without replacement', ...

Arguments describing the bootstrapping

methodology_for_bootstrapping

The default for this argument is "No bootstrapping". Other options are "Sample exposed" and "Sample units of observations", and are decsribed here.

In the current implementation only 'Sample units of observations' is available, and is default if number_of_bootstrapping_samples is specified

number_of_bootstrapping_samples

This argument is only allowed if methodology_for_bootstrapping is different from "No bootstrapping"

Auxiliary arguments

temporary_folder

Name of the folder where temporary files are stored.

Back to the Wiki Contents

Subfunctions

...

Back to the Wiki Contents

Flow of the function

...

Back to the Wiki Contents

How to install and use the package

...

Back to the Wiki Contents

Examples

Example 1

...

Back to the Wiki Contents

Example 2

...

Back to the Wiki Contents

Example 3

...

Back to the Wiki Contents

Example 4

...

Back to the Wiki Contents

Example 5

...

Back to the Wiki Contents

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Contents

Context

Purpose

Structure of input data

Data model of the exposed dataset

Data model of the candidate_matches dataset

Arguments of the function

Subfunctions

Flow of the function

How to install and use the package

Examples

Example 1

Example 2

Example 3

Example 4

Example 5

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally