Customize Data Drift
How to change data drift detection methods and conditions.
All Metrics and Presets that evaluate shift in data distributions use the default Data Drift algorithm. It automatically selects the drift detection method based on the column type (text, categorical, numerical) and volume.
You can override the defaults by passing a custom parameter to the chosen Metric or Preset. You can modify the drift detection method (choose from 20+ available), thresholds, or both.
You can also implement fully custom drift detection methods.
Pre-requisites:
-
You know how to use Data Definition to map column types.
Data drift parameters
Setting conditions for data drift works differently from the usual Test API (with gt
, lt
, etc.) This accounts for nuances like varying role of thresholds across drift detection methods, where “greater” can be better or worse depending on the method.
Dataset-level
Dataset drift share. You can set the share of drifting columns that signals dataset drift (default: 0.5) in the relevant Metrics or Presets. For example, to set it at 70%:
This will detect dataset drift if over 70% columns are drifting, using defaults for each column.
Drift methods. You can also specify the drift detection methods used on the column level. For example, to use PSI (Population Stability Index) for all columns in the dataset:
This will check if over 70% columns are drifting, using PSI method with default thresholds.
See all available methods in the table below.
Drift thresholds. You can set thresholds for each method. For example, use PSI with a threshold of 0.3 for categorical columns.
In this case, if PSI is ≥ 0.3 for any categorical column, drift will be detected for that column. The rest of the checks will use defaults: default methods for numerical and text columns (if present), and 50% as the drift_share
threshold.
Column-level
For column-level metrics, you can set the drift method/threshold directly for each column:
All parameters
Use the following parameters to pass chosen drift methods. See methods and their defaults below.
Parameter | Description | Applies To |
---|---|---|
stattest | Defines the drift detection method for a given column (if one column is tested), or all columns in the dataset (if multiple columns are tested and the method can apply to all columns). | ValueDrift() , DriftedColumnsCount() , DataDriftPreset() |
stattest_threshold | Sets the drift threshold in a given column or all columns. The threshold meaning varies based on the drift detection method, e.g., it can be the value of a distance metric or a p-value of a statistical test. | ValueDrift() , DriftedColumnsCount() , DataDriftPreset() |
drift_share | Defines the share of drifting columns as a condition for Dataset Drift. Default: 0.5 | DriftedColumnsCount() , DataDriftPreset() |
cat_stattest cat_stattest_threshold | Sets the drift method and/or threshold for all categorical columns. | DriftedColumnsCount() , DataDriftPreset() |
num_stattest num_stattest_threshold | Sets the drift method and/or threshold for all numerical columns. | DriftedColumnsCount() , DataDriftPreset() |
per_column_stattest per_column_stattest_threshold | Sets the drift method and/or threshold for the listed columns (accepts a dictionary). | DriftedColumnsCount() , DataDriftPreset() |
text_stattest text_stattest_threshold | Defines the drift detection method and threshold for all text columns. | DriftedColumnsCount() , DataDriftPreset() |
Data drift detection methods
Tabular data
The following methods apply to tabular data: numerical or categorical columns in data definition. Pass them using the stattest
(or num_stattest
, etc.) parameter.
StatTest | Applicable to | Drift score |
---|---|---|
ks Kolmogorov–Smirnov (K-S) test | tabular data only numerical Default method for numerical data, if ≤ 1000 objects | returns p_value drift detected when p_value < threshold default threshold: 0.05 |
chisquare Chi-Square test | tabular data only categorical Default method for categorical with > 2 labels, if ≤ 1000 objects | returns p_value drift detected when p_value < threshold default threshold: 0.05 |
z Z-test | tabular data only categorical Default method for binary data, if ≤ 1000 objects | returns p_value drift detected when p_value < threshold default threshold: 0.05 |
wasserstein Wasserstein distance (normed) | tabular data only numerical Default method for numerical data, if > 1000 objects | returns distance drift detected when distance ≥ threshold default threshold: 0.1 |
kl_div Kullback-Leibler divergence | tabular data numerical and categorical | returns divergence drift detected when divergence ≥ threshold default threshold: 0.1 |
psi Population Stability Index (PSI) | tabular data numerical and categorical | returns psi_value drift detected when psi_value ≥ threshold default threshold: 0.1 |
jensenshannon Jensen-Shannon distance | tabular data numerical and categorical Default method for categorical, if > 1000 objects | returns distance drift detected when distance ≥ threshold default threshold: 0.1 |
anderson Anderson-Darling test | tabular data only numerical | returns p_value drift detected when p_value < threshold default threshold: 0.05 |
fisher_exact Fisher’s Exact test | tabular data only categorical | returns p_value drift detected when p_value < threshold default threshold: 0.05 |
cramer_von_mises Cramer-Von-Mises test | tabular data only numerical | returns p_value drift detected when p_value < threshold default threshold: 0.05 |
g-test G-test | tabular data only categorical | returns p_value drift detected when p_value < threshold default threshold: 0.05 |
hellinger Hellinger Distance (normed) | tabular data numerical and categorical | returns distance drift detected when distance >= threshold default threshold: 0.1 |
mannw Mann-Whitney U-rank test | tabular data only numerical | returns p_value drift detected when p_value < threshold default threshold: 0.05 |
ed Energy distance | tabular data only numerical | returns distance drift detected when distance >= threshold default threshold: 0.1 |
es Epps-Singleton test | tabular data only numerical | returns p_value drift detected when p_value < threshold default threshold: 0.05 |
t_test T-Test | tabular data only numerical | returns p_value drift detected when p_value < threshold default threshold: 0.05 |
empirical_mmd Empirical-MMD | tabular data only numerical | returns p_value drift detected when p_value < threshold default threshold: 0.05 |
TVD Total-Variation-Distance | tabular data only categorical | returns p_value drift detected when p_value < threshold default threshold: 0.05 |
Text data
Text drift detection applies to columns with raw text data, as specified in data definition. Pass them using the stattest
(or text_stattest
) parameter.
StatTest | Description | Drift score |
---|---|---|
perc_text_content_drift Text content drift (domain classifier, with statistical hypothesis testing) | Applies only to text data. Trains a classifier model to distinguish between text in “current” and “reference” datasets. Default for text data ≤ 1000 objects. |
|
abs_text_content_drift Text content drift (domain classifier) | Applies only to text data. Trains a classifier model to distinguish between text in “current” and “reference” datasets. Default for text data when > 1000 objects. |
|
Text descriptors drift. If you work with raw text data, you can also check for distribution drift in text descriptors (such as text length, etc.) To use this method, first compute the selected text descriptors. Then, use numerical / categorical drift detection methods as usual.
Add a custom method
If you do not find a suitable drift detection method, you can implement a custom function:
We recommended writing a specific instance of the StatTest class for that function. You need:
Parameter | Type | Description |
---|---|---|
name | str | A short name used to reference the Stat Test from the options (registered globally). |
display_name | str | A long name displayed in the Report. |
func | Callable | The StatTest function. |
allowed_feature_types | List[str] | The list of allowed feature types for this function (cat , num ). |
The StatTest function itself should match (reference_data: pd.Series, current_data: pd.Series, threshold: float) -> Tuple[float, bool]
signature.
Accepts:
-
reference_data: pd.Series
- The reference data series. -
current_data: pd.Series
- The current data series to compare. -
feature_type: str
- The type of feature being analyzed. -
threshold: float
- The test threshold for drift detection.
Returns:
-
score: float
- Stat Test score (actual value) -
drift_detected: bool
- indicates is drift detected with given threshold