- Drift is one of the top priorities and concerns of data scientists and machine learning engineers.
- From day 1, the data that our models utilize to infer the predictions is already different from the data on which they trained.
- Errors in data collection
- Changes in the way people behave
- Time gaps that alter what is considered a good prediction and what is not.
-
Concept drift (model drift) is formaly defined as “a change in the joint probability distribution, i.e.,
$P_t(X,y) \neq P_{t+n}(X,y).$ ” - We can decompose the joint probability
$P(X,y)$ into smaller components to better understand what changes in these components can trigger concept drift:-
Covariate shift
$P(X)$ (also known as input drift, data drift, or population drift): occurs when there are changes in the distribution of the input variables (i.e., features).- Detection:
- Covariate shift can be detected on a univariate level, also referred to as feature drift
- It may also be analyzed on a multi-variate (e.g: train the classifier to differentiate training features and inference features) level across the entire feature space distribution.
- Detection:
-
Prior probability shift
$P(y)$ (label drift, unconditional class shift, or prior probability shift): occurs when there are changes in the distribution of the class variable$y$ . -
Posterior class shift
$P(y | X)$ (concept shift, or "real concept drift") refers to changes in the relationship between the input variables and the target variables.
-
Covariate shift
- Gradual: a gradual transition will happen over time when new concepts come into play. For example, in a movie recommendation task, movies, genres, and user preferences all change gradually over time.
- Sudden: a drift can happen suddenly, for example, when a sensor is replaced by another one with a different calibration.
- Incremental: drift can also happen in a sequence of small steps, such that it is only noticed after a long period of time. An example may be when a sensor wears down and becomes less accurate over time.
- Blip: spikes or blips are basically outliers or one-off occurrences that influence the model. This may be a war or some exceptional event. Of course, there may be recurring anomalies that take place—with no clue as to when they may happen again.
- To measure the statistics distance between the tested distribution and the reference distribution of a given feature.
- Many machine learning models leverage dozens, hundreds, or even thousands of different features. In scenarios of high dimensionality, looking for drift only at the feature level will quickly lead to an overload of alerts due to the granularity of the measurement method.
- Solution:
- Model-based detection: train a classifier to differetiate between the
$X_{reference}$ assigning with class$0$ , and$X_{current}$ assigning with class$1$ . If the classifierAUC_ROC> 0.5, meaning that the classifier is able to detect the difference between$X_{reference}$ and$X_{current}$ , hence there is a multivariate drift.- Note:
$X_{reference}$ and$X_{current}$ have the same set of feature columns
- Note:
- Model-based detection: train a classifier to differetiate between the
- Method 1 - Statistics Distance: to ensure the model prediction distribution doesn’t drastically change from a baseline. The baseline can be a training production window of data or a training or validation dataset.
- This is useful when there is the delayed ground truth to compare against production model decisions.
- Method 2 - Change in feature importance
-
Statistical distance measures are defined between two distributions: reference distribution and current distribution.
Metrics Usage Symmetry Range Threshold Note Population Stability Index (PSI) #1: monitor top features for feature drift; monitor prediction output and ground truths for concept drift
#2: Numeric and CategoricalSymmetry PSI < 0.1no significant population change0.1 < PSI < 0.2moderate population changePSI > 0.2significant population changePSI bins and PSI quantiles (calculating the same metric with 2 different binning strategies) Pearson’s Chi-Square Distance To be updated Kullback–Leibler (KL) Divergence #1: One distribution has a high variance relative (training data) to small sample size (inference data)
#2: Numerical (binning required) and CategoricalAsymmetry $KL \in [0,\infty)$ When $KL=0$ , it means that the two distributions are identical.Jensen Shannon Numerical (binning required) and Categorical Symmetry Between 0 and 1 a bounded variant of KL divergence Wasserstein Distance (a.k.a. Earth Mover’s distance) Numerical Kolmogorov–Smirnov
