Drift in Machine Learning (ML)

Introduction

Drift is one of the top priorities and concerns of data scientists and machine learning engineers.
From day 1, the data that our models utilize to infer the predictions is already different from the data on which they trained.

What causes drift in ML?

Errors in data collection
Changes in the way people behave
Time gaps that alter what is considered a good prediction and what is not.

Type of ML drifts

Concept drift (model drift) is formaly defined as “a change in the joint probability distribution, i.e., $P_t(X,y) \neq P_{t+n}(X,y).$”
We can decompose the joint probability $P(X,y)$ into smaller components to better understand what changes in these components can trigger concept drift:
- Covariate shift $P(X)$ (also known as input drift, data drift, or population drift): occurs when there are changes in the distribution of the input variables (i.e., features).
  - Detection:
    - Covariate shift can be detected on a univariate level, also referred to as feature drift
    - It may also be analyzed on a multi-variate (e.g: train the classifier to differentiate training features and inference features) level across the entire feature space distribution.
- Prior probability shift $P(y)$ (label drift, unconditional class shift, or prior probability shift): occurs when there are changes in the distribution of the class variable $y$.
- Posterior class shift $P(y | X)$ (concept shift, or "real concept drift") refers to changes in the relationship between the input variables and the target variables.

Drift Patterns

Gradual: a gradual transition will happen over time when new concepts come into play. For example, in a movie recommendation task, movies, genres, and user preferences all change gradually over time.
Sudden: a drift can happen suddenly, for example, when a sensor is replaced by another one with a different calibration.
Incremental: drift can also happen in a sequence of small steps, such that it is only noticed after a long period of time. An example may be when a sensor wears down and becomes less accurate over time.
Blip: spikes or blips are basically outliers or one-off occurrences that influence the model. This may be a war or some exceptional event. Of course, there may be recurring anomalies that take place—with no clue as to when they may happen again.

Data Drift

Univariate Drift Detection

To measure the statistics distance between the tested distribution and the reference distribution of a given feature.

Multivariate Drift Detection

Many machine learning models leverage dozens, hundreds, or even thousands of different features. In scenarios of high dimensionality, looking for drift only at the feature level will quickly lead to an overload of alerts due to the granularity of the measurement method.
Solution:
- Model-based detection: train a classifier to differetiate between the $X_{reference}$ assigning with class $0$, and $X_{current}$ assigning with class $1$. If the classifier AUC_ROC > 0.5, meaning that the classifier is able to detect the difference between $X_{reference}$ and $X_{current}$, hence there is a multivariate drift.
  - Note: $X_{reference}$ and $X_{current}$ have the same set of feature columns

Concept Drift Detection

Method 1 - Statistics Distance: to ensure the model prediction distribution doesn’t drastically change from a baseline. The baseline can be a training production window of data or a training or validation dataset.
- This is useful when there is the delayed ground truth to compare against production model decisions.
Method 2 - Change in feature importance

Statistical Distance Metrics for Drift Detection

Statistical distance measures are defined between two distributions: reference distribution and current distribution.

Metrics	Usage	Symmetry	Range	Threshold	Note
Population Stability Index (PSI)	#1: monitor top features for feature drift; monitor prediction output and ground truths for concept drift #2: Numeric and Categorical	Symmetry		`PSI < 0.1` no significant population change `0.1 < PSI < 0.2` moderate population change `PSI > 0.2` significant population change	PSI bins and PSI quantiles (calculating the same metric with 2 different binning strategies)
Pearson’s Chi-Square Distance	To be updated
Kullback–Leibler (KL) Divergence	#1: One distribution has a high variance relative (training data) to small sample size (inference data) #2: Numerical (binning required) and Categorical	Asymmetry	$KL \in [0,\infty)$	When $KL=0$, it means that the two distributions are identical.
Jensen Shannon	Numerical (binning required) and Categorical	Symmetry	Between 0 and 1		a bounded variant of KL divergence
Wasserstein Distance (a.k.a. Earth Mover’s distance)	Numerical
Kolmogorov–Smirnov

Reference

Using Statistical Distances for Machine Learning Observability

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drift in Machine Learning (ML)

Introduction

What causes drift in ML?

Type of ML drifts

Drift Patterns

Data Drift

Univariate Drift Detection

Multivariate Drift Detection

Concept Drift Detection

Statistical Distance Metrics for Drift Detection

Reference

FilesExpand file tree

drift_in_ml.md

Latest commit

History

drift_in_ml.md

File metadata and controls

Drift in Machine Learning (ML)

Introduction

What causes drift in ML?

Type of ML drifts

Drift Patterns

Data Drift

Univariate Drift Detection

Multivariate Drift Detection

Concept Drift Detection

Statistical Distance Metrics for Drift Detection

Reference