Skip to content
Olga Paoletti edited this page Jun 15, 2022 · 13 revisions

CreateSpells

Context

Units of observation may have multiple observation periods, during which they may be observed by different observers. An example would be the records of the OBSERVATION_PERIODS of the ConcePTION CDM. This function is not designed to be used to compute episodes of treatment: for that functionality, which is much more sophisticated, we recommend the package AdhereR. This function is designed to select an observation window for the participation of each unit of observation in the study, see recommendation C.4 in Table 2 in (Wang SV, Schneeweiss S, Berger ML, Brown J, de Vries F, Douglas I, et al. Reporting to Improve Reproducibility and Facilitate Validity Assessment for Healthcare Database Studies V1.0. Pharmacoepidemiol Drug Saf. 2017 Sep;26(9):1018–32.)

Purpose

CreateSpells takes as input a dataset with multiple time windows per unit of observation. Multiple categories of time windows may be recorded per unit, and time windows of the same unit may overlap, even within the same category. The purpose of the function is to create a dataset where the time windows of each person and category are disjoint (a time window that is disjoint from the others is called spell). Additionally, a category '_overall' is added, where time windows are processed regardless of their category. As an option, the overlap of pairs of categories is also processed: each pair will be associated with spells where both values are recorded.

Structure of input data

A dataset with multiple rows per unit of observation each containing a start date and an end date (time windows) and labeled with a categorical variable. There are no requirements on the overlap of start and end dates of the same unit of observation, except that they should never be missing and that the start date of each row should not be after the end date. In fact, end dates may happen to be missing, but in case they are, the option replace_missing_end_date must be specified.

Main parameters

  • dataset: name of the dataset
  • id: variable containing the identifier of the unit of observation
  • start_date: variable containing the start date
  • end_date: variable containing the end date
  • category (optional): categorical variable
  • replace_missing_end_date: (optional). When specified, it contains a date to replace end_date when it is missing.
  • overlap: (optional) default FALSE. If TRUE, overlaps of pairs of categories are processed as well.
  • only_overlaps: (optional) if only_overlaps is TRUE, skip the calculation the spells
  • dataset_overlap: (optional) if overlap is TRUE, dataset_overlap contains the name of the file where the function must store the overlap dataset at the end of the execution -gap_allowed: a number corrisponding to the number of days allowed between spells

Structure of output data

  • Output 1 (returned as output of the function):
    • id: variable containing the identifier of the unit of observation
    • spell_start_date: variable containing the start date
    • spell_end_date: variable containing the end date
    • category: categorical variable
    • spell_num: number of spell (the first has number 1)
  • Output 2 (assigned in memory to the label dataset_overlap):
    • id: variable containing the identifier of the unit of observation
    • spell_start_date: variable containing the start date
    • spell_end_date: variable containing the end date
    • category: categorical variable containing pairs of categories
    • spell_num: number of spell (the first has number 1)

Action

  • Transform from YYMMDD int/char to date
  • Check that
    • start_date never missing
    • end_date never missing (or replace it with the default date if the option is specified)
    • end_date always >= start_date
    • in case of overlap = TRUE:
      • more than one category
  • Exclude periods with the start date after the end date
  • In case the optional parameter "category" is specified:
    • If there are more than one categories, create a copy of the dataset and replace its categories with ‘_overall’, append the dataset to the original one; otherwise, replace ‘category’ with ‘_overall’
    • Sort dataset by id, category, start_date, and end_date *
  • Generate row_id as an integer row identifier for each observation period stratifying by id and category *
  • Generate lag_end_date
    • For row_id = 1 use the end_date and for row_id > 1 use the lagged end_date
    • Transform it numeric
    • Calculate the cumulative max stratified by id and category * (This step require a numeric variable)
    • Transform it back to date
  • Generate num_spell
    • For row_id > 1 and start_date <= lag_end_date + gap_allowed assign 0 otherwise assign 1
    • Calculate the cumulative sum stratified by id and category *
  • Calculate the entry_spell_category and exit_spell_category respectively as the min of start_date and max od end_date stratifying by id, num_spell, and category *
  • Drop duplicated rows
  • This is the first dataset to be kept in memory (output_spells_category)
  • If overlap is TRUE:
    • Exclude records with category "overall"
    • Create the list of pairs of categories
    • For each pair of categories A and B:
      • create two temporary datasets: outputA restricted to category A, and outputB restricted to category B
      • Rename entry_spell as entry_spell_category_A, exit_spell as exit_spell_category_A. Same for B
      • Create the list of pairs of categories
      • Perform a full join of the two datasets using the id as key
      • Keep rows that satisfy the condition: (entry_spell_category_A <= exit_spell_category_B AND exit_spell_category_A >= entry_spell_category_B) OR (entry_spell_category_B < exit_spell_category_A AND exit_spell_category_B >= entry_spell_category_A)
      • generate entry_spell as max(entry_spell_category_A, entry_spell_category_B) stratified by id
      • generate exit_spell as min(exit_spell_category_A, exit_spell_category_B) stratified by id
      • generate category as A_B
      • sort per id and entry_spell
      • generate num_spell as the enumeration of rows within id
    • Bind the dataset corresponding to each pair
    • This is the second dataset to be kept in memory (dataset_overlap)

* in case the parameter "Category" is missing this step is to be calculated without it

Clone this wiki locally