For an intro, read Core Concepts and check quickstarts for LLMs or ML.

Text Evals

Use to summarize results of output-level text or LLM evals.

Data definition:

MetricDescriptionParametersTest Defaults
TextEvals()Optional:
  • columns
As in Metrics included in ValueStats

Columns

Use to aggregate descriptor results or check data quality on column level.

Data definition: map column types.

Value stats

Descriptive statistics.

MetricDescriptionParametersTest Defaults
ValueStats()
  • Small Preset, column-level.
  • Computes various descriptive stats (min, max, mean, quantiles, most common, etc.)
  • Returns different stats based on the column type (text, categorical, numerical, datetime).
Required:
  • column
Optional:
  • No reference. As in individual Metrics.
  • With reference. As in indiviudal Metrics.
MinValue()
  • Column-level.
  • Returns min value for a given numerical column.
  • Metric result: value.
Required:
  • column
Optional:
  • No reference. N/A.
  • With reference. Fails if Min Value is differs by more than 10% (+/-).
StdValue()
  • Column-level.
  • Computes the standard deviation of a given numerical column.
  • Metric result: value.
Required:
  • column
Optional:
  • No reference. N/A.
  • With reference. Fails if the standard deviation differs by more than 10% (+/-).
MeanValue()
  • Column-level.
  • Computes the mean value of a given numerical column.
  • Metric result: value.
Required:
  • column
Optional:
  • No reference. N/A.
  • With reference. Fails if the mean value differs by more than 10%.
MaxValue()
  • Column-level.
  • Computes the max value of a given numerical column.
  • Metric result: value.
Required:
  • column
Optional:
  • No reference. N/A.
  • With reference. Fails if the max value is higher than in the reference.
MedianValue()
  • Column-level.
  • Computes the median value of a given numerical column.
  • Metric result: value.
Required:
  • column
Optional:
  • No reference. N/A.
  • With reference. Fails if the median value differs by more than 10% (+/-).
QuantileValue()
  • Column-level.
  • Computes the quantile value of a given numerical column.
  • Defaults to 0.5 if no quantile is specified.
  • Metric result: value.
Required:
  • column
Optional:
  • No reference. N/A.
  • With reference. Fails if quantile value differs by more than 10% (+/-).
CategoryCount()

Example:
CategoryCount(
column="city",
category="NY")
  • Column-level.
  • Counts occurrences of the specified category.
  • Metric result: count, share.
Required:
  • column
  • category
Optional:
  • No reference. N/A.
  • With reference. Fails if the specified category is not present.

Column data quality

Column-level data quality metrics.

Data definition: map column types.

MetricDescriptionParametersTest Defaults
MissingValueCount()
  • Column-level.
  • Counts the number and share of missing values.
  • Metric result: count, share.
Required:
  • column
Optional:
  • No reference: Fails if there are missing values.
  • With reference: Fails if share of missing values is >10% higher.
NewCategoriesCount() (Coming soon)
  • Column-level.
  • Counts new categories compared to reference (reference required).
  • Metric result: count, share.
Required:
  • column
Optional:
Expect 0.
MissingCategoriesCount() (Coming soon)
  • Column-level.
  • Counts missing categories compared to reference.
  • Metric result: count, share.
Required:
  • column
Optional:
Expect 0.
InRangeValueCount()

Example:
InRangeValueCount(
column="age",
left="1", right="18")
  • Column-level.
  • Counts the number and share of values in the set range.
  • Metric result: count, share.
Required:
  • column
  • left
  • right
Optional:
  • No reference: N/A.
  • With reference: Fails if column contains values out of the min-max reference range.
OutRangeValueCount()
  • Column-level.
  • Counts the number and share of values out of the set range.
  • Metric result: count, share.
Required:
  • column
  • left
  • right
Optional:
  • No reference: N/A.
  • With reference: Fails if any value is out of min-max reference range.
InListValueCount()
  • Column-level.
  • Counts the number and share of values in the set list.
  • Metric result: count, share.
Required:
  • column
  • values
Optional:
  • No reference: N/A.
  • With reference: Fails if any value is out of list.
OutListValueCount()

Example:
OutListValueCount(
column="city",
values=["Lon", "NY"])
  • Column-level.
  • Counts the number and share of values out of the set list.
  • Metric result: count, share.
Required:
  • column
  • values
Optional:
  • No reference: N/A.
  • With reference: Fails if any value is out of list.
UniqueValueCount()
  • Column-level.
  • Counts the number and share of unique values.
  • Metric result: values (dict with count, share).
Required:
  • column
Optional:
  • No reference: N/A.
  • With reference: Fails if the share of unique values differs by >10% (+/-).
MostCommonValueCount() (Coming soon)
  • Column-level.
  • Identifies the most common value and provides its count/share.
  • Metric result: value: count, share.
Required:
  • column
Optional:
  • No reference: Fails if most common value share is ≥80%.
  • With reference: Fails if most common value share differs by >10% (+/-).

Dataset

Use for exploratory data analysis and data quality checks.

Data definition: map column types, ID and timestamp if available.

Dataset stats

Descriptive statistics.

MetricDescriptionParametersTest Defaults
DataSummaryPreset()
  • Large Preset.
  • Combines DatasetStats and ValueStats for all or specified columns.
  • Metric result: for all Metrics.
  • Preset page
Optional:
  • columns
As in individual Metrics.
DatasetStats()
  • Small preset.
  • Dataset-level.
  • Calculates descriptive dataset stats, including columns by type, rows, missing values, empty columns, etc.
  • Metric result: for all Metrics.
None
  • No reference: As in included Metrics
  • With reference: As in included Metrics.
RowCount()
  • Dataset-level.
  • Counts the number of rows.
  • Metric result: value.
Optional:
  • No reference: N/A.
  • With reference: Fails if row count differs by >10%.
ColumnCount()
  • Dataset-level.
  • Counts the number of columns.
  • Metric result: value.
Optional:
  • No reference: N/A.
  • With reference: Fails if not equal to reference.

Dataset data quality

Dataset-level data quality metrics.

Data definition: map column types, ID and timestamp if available.

MetricDescriptionParametersTest Defaults
ConstantColumnsCount()
  • Dataset-level.
  • Counts the number of constant columns.
  • Metric result: value.
Optional:
  • No reference: Fails if there is at least one constant column.
  • With reference: Fails if count is higher than in reference.
EmptyRowsCount()
  • Dataset-level.
  • Counts the number of empty rows.
  • Metric result: value.
Optional:
  • No reference: Fails if there is at least one empty row.
  • With reference: Fails if share differs by >10%.
EmptyColumnsCount()
  • Dataset-level.
  • Counts the number of empty columns.
  • Metric result: value.
Optional:
  • No reference: Fails if there is at least one empty column.
  • With reference: Fails if count is higher than in reference.
DuplicatedRowCount()
  • Dataset-level.
  • Counts the number of duplicated rows.
  • Metric result: value.
Optional:
  • No reference: Fails if there is at least one duplicated row.
  • With reference: Fails if share differs by >10% (+/-).
DuplicatedColumnsCount()
  • Dataset-level.
  • Counts the number of duplicated columns.
  • Metric result: value.
Optional:
  • No reference: Fails if there is at least one duplicated column.
  • With reference: Fails if count is higher than in reference.
DatasetMissingValueCount()
  • Dataset-level.
  • Calculates the number and share of missing values.
  • Displays the number of missing values per column.
  • Metric result: value.
Required:
  • columns
Optional:
  • No reference: Fails if there are missing values.
  • With reference: Fails if share is >10% higher than reference (+/-).
AlmostEmptyColumnCount() (Coming soon)
  • Dataset-level.
  • Counts almost empty columns (95% empty).
  • Metric result: value.
Optional:
  • No reference: Fails if there is at least one almost empty column.
  • With reference: Fails if count is higher than in reference.
AlmostConstantColumnsCount()
  • Dataset-level.
  • Counts almost constant columns (95% identical values).
  • Metric result: value.
Optional:
  • No reference: Fails if there is at least one almost constant column.
  • With reference: Fails if count is higher than in reference.
RowsWithMissingValuesCount() (Coming soon)
  • Dataset-level.
  • Counts rows with missing values.
  • Metric result: value.
Optional:
  • No reference: Fails if there is at least one row with missing values.
  • With reference: Fails if share differs by >10% (+/-)
ColumnsWithMissingValuesCount()
  • Dataset-level.
  • Counts columns with missing values.
  • Metric result: value.
Optional:
  • No reference: Fails if there is at least one column with missing values.
  • With reference: Fails if count is higher than in reference.

Data Drift

Use to detect distribution drift for text, tabular, embeddings data or over computed text descriptors. 20+ drift methods listed separately: text and tabular, embeddings.

Data definition: map column types, ID and timestamp if available.

Metric explainers. Check data drift metrics explainers.
MetricDescriptionParametersTest Defaults
DataDriftPreset()
  • Large Preset.
  • Requires reference.
  • Calculates data drift for all or set columns.
  • Uses the default or set method.
  • Returns drift score for each column.
  • Visualizes all distributions.
  • Metric result: all Metrics.
  • Preset page.
Optional:
  • columns
  • method
  • cat_method
  • num_method
  • per_column_method
  • threshold
  • cat_threshold
  • num_threshold
  • per_column_threshold
See drift options.
  • With reference: Data drift defaults, depending on column type. See drift methods.
DriftedColumnsCount()
  • Dataset-level.
  • Requires reference.
  • Calculates the number and share of drifted columns in the dataset.
  • Each column is tested for drift using the default algorithm or set method.
  • Returns only the total number of drifted columns.
  • Metric result: count, share.
Optional:
  • columns
  • method
  • cat_method
  • num_method
  • per_column_method
  • threshold
  • cat_threshold
  • num_threshold
  • per_column_threshold
See drift options.
  • With reference: Fails if 50% of columns are drifted.
ValueDrift()
  • Column-level.
  • Requires reference.
  • Calculates data drift for a defined column (num, cat, text).
  • Visualizes distributions.
  • Metric result: value.
Required:
  • column
Optional:
  • method
  • threshold
See drift options.
  • With reference: Data drift defaults, depending on column type. See drift methods.
MultivariateDrift() (Coming soon)
  • Dataset-level.
  • Requires reference.
  • Computes a single dataset drift score.
  • Default method: share of drifted columns.
  • Metric result: value.
Optional:
  • columns
  • method
See drift options.
  • With reference: Defaults for method. See methods.
EmbeddingDrift() (Coming soon)
  • Column-level.
  • Requires reference.
  • Calculates data drift for embeddings.
  • Requires embedding columns set in data definition.
  • Metric result: value.
Required:
  • embeddings
  • method
See embedding drift options.
  • With reference: Defaults for method. See methods.

Correlations

Use for exploratory data analysis, drift monitoring (correlation changes) or to check alignment between scores (e.g. LLM-based descriptors against human labels).

Data definition: map column types.

MetricDescriptionParametersTest Defaults
DatasetCorrelations() (Coming soon)
  • Calculates the correlations between all or set columns in the dataset.
  • Supported methods: Pearson, Spearman, Kendall, Cramer_V.
Optional: N/A
Correlation() (Coming soon)
  • Calculates the correlation between two defined columns.
Required:
  • column_x
  • column_y
Optional:
  • method (default: pearson, available: pearson, spearman, kendall, cramer_v)
  • Test conditions
N/A
CorrelationChanges() (Coming soon)
  • Dataset-level.
  • Reference required.
  • Checks the number of correlation violations (significant changes in correlation strength between columns) across all or set columns.
Optional:
  • columns
  • method (default: pearson, available: pearson, spearman, kendall, cramer_v)
  • corr_diff (default: 0.25)
  • Test conditions
  • With reference: Fails if at least one correlation violation is detected.

Classification

Use to evaluate quality on a classification task (probabilistic, non-probabilistic, binary and multi-class).

Data definition: map prediction and target columns and classification type.

General

Use for binary classification and aggregated results for multi-class.

Metric explainers. Check classification metrics explainers.
MetricDescriptionParametersTest Defaults
ClassificationPreset()
  • Large Preset with many classification Metrics and visuals.
  • See Preset page.
  • Metric result: all Metrics.
Optional: probas_threshold .As in individual Metrics.
ClassificationQuality()
  • Small Preset.
  • Summarizes quality Metrics in a single widget.
  • Metric result: all Metrics.
Optional: probas_thresholdAs in individual Metrics.
LabelCount() (Coming soon)
  • Distribution of predicted classes.
  • Can visualize class balance and/or probability distribution.
Required:
  • Set at least one visualization: class_balance, prob_distribution.
Optional:
N/A
Accuracy()
  • Calculates accuracy.
  • Metric result: value.
Optional:
  • No reference: Fails if lower than dummy model accuracy.
  • With reference: Fails if accuracy differs by >20%.
Precision()
  • Calculates precision.
  • Visualizations available: Confusion Matrix, PR Curve, PR Table.
  • Metric result: value.
Required:
  • Set at least one visualization: conf_matrix, pr_curve, pr_table.
Optional:
  • probas_threshold (default: None or 0.5 for probabilistic classification)
  • top_k
  • Test conditions
  • No reference: Fails if Precision is lower than the dummy model.
  • With reference: Fails if Precision differs by >20%.
Recall()
  • Calculates recall.
  • Visualizations available: Confusion Matrix, PR Curve, PR Table.
  • Metric result: value.
Required:
  • Set at least one visualization: conf_matrix, pr_curve, pr_table.
Optional:
  • No reference: Fails if lower than dummy model recall.
  • With reference: Fails if Recall differs by >20%.
F1Score()
  • Calculates F1 Score.
  • Metric result: value.
Required:
  • Set at least one visualization: conf_matrix.
Optional:
  • No reference: Fails if lower than dummy model F1.
  • With reference: Fails if F1 differs by >20%.
TPR()
  • Calculates True Positive Rate (TPR).
  • Metric result: value.
Required:
  • Set at least one visualization: pr_table.
Optional:
  • No reference: Fails if TPR is lower than the dummy model.
  • With reference: Fails if TPR differs by >20%.
TNR()
  • Calculates True Negative Rate (TNR).
  • Metric result: value.
Required:
  • Set at least one visualization: pr_table.
Optional:
  • No reference: Fails if TNR is lower than the dummy model.
  • With reference: Fails if TNR differs by >20%.
FPR()
  • Calculates False Positive Rate (FPR).
  • Metric result: value.
Required:
  • Set at least one visualization: pr_table.
Optional:
  • No reference: Fails if FPR is higher than the dummy model.
  • With reference: Fails if FPR differs by >20%.
FNR()
  • Calculates False Negative Rate (FNR).
  • Metric result: value.
Required:
  • Set at least one visualization: pr_table.
Optional:
  • No reference: Fails if FNR is higher than the dummy model.
  • With reference: Fails if FNR differs by >20%.
LogLoss()
  • Calculates Log Loss.
  • Metric result: value.
Required:
  • Set at least one visualization: pr_table.
Optional:
  • No reference: Fails if LogLoss is higher than the dummy model (equals 0.5 for a constant model).
  • With reference: Fails if LogLoss differs by >20%.
RocAUC()
  • Calculates ROC AUC.
  • Can visualize PR curve or table.
  • Metric result: value.
Required:
  • Set at least one visualization: pr_table, roc_curve.
Optional:
  • No reference: Fails if ROC AUC is ≤ 0.5.
  • With reference: Fails if ROC AUC differs by >20%.
Lift() (Coming soon)
  • Calculates lift.
  • Can visualize lift curve or table.
  • Metric result: value.
Required:
  • Set at least one visualization: lift_table, lift_curve.
Optional:
N/A

Dummy metrics:

By label

Use when you have multiple classes and want to evaluate quality separately.

MetricDescriptionParametersTest Defaults
ClassificationQualityByLabel()
  • Small Preset summarizing classification quality Metrics by label.
  • Metric result: all Metrics.
NoneAs in individual Metrics.
PrecisionByLabel()
  • Calculates precision by label in multiclass classification.
  • Metric result (dict): label: value.
Optional:
  • No reference: Fails if Precision is lower than the dummy model.
  • With reference: Fails if Precision differs by >20%.
F1ByLabel()
  • Calculates F1 Score by label in multiclass classification.
  • >Metric result (dict): label: value.
Optional:
  • No reference: Fails if F1 is lower than the dummy model.
  • With reference: Fails if F1 differs by >20%.
RecallByLabel()
  • Calculates recall by label in multiclass classification.
  • >Metric result (dict): label: value
Optional:
  • No reference: Fails if Recall is lower than the dummy model.
  • With reference: Fails if Recall differs by >20%.
RocAUCByLabel()
  • Calculates ROC AUC by label in multiclass classification.
  • Metric result (dict): label: value
Optional:
  • No reference: Fails if ROC AUC is ≤ 0.5.
  • With reference: Fails if ROC AUC differs by >20%.

Regression

Use to evaluate the quality of a regression model.

Data definition: map prediction and target columns.

Metric explainers. Check regression metrics explainers.
MetricDescriptionParametersTest Defaults
RegressionPreset
  • Large Preset.
  • Includes a wide range of regression metrics with rich visuals.
  • Metric result: all metrics.
  • See Preset page.
None.As in individual metrics.
RegressionQuality
  • Small Preset.
  • Summarizes key regression metrics in a single widget.
  • Metric result: all metrics.
None.As in individual metrics.
MeanError()
  • Calculates the mean error.
  • Visualizations available: Error Plot, Error Distribution, Error Normality.
  • Metric result: mean_error, error_std.
Required:
  • Set at least one visualization: error_plot, error_distr, error_normality.
Optional:
  • No reference/With reference: Expect ME to be near zero. Fails if Mean Error is skewed and condition is violated: eq = approx(absolute=0.1 * error_std).
MAE()
  • Calculates Mean Absolute Error (MAE).
  • Visualizations available: Error Plot, Error Distribution, Error Normality.
  • Metric result: mean_absolute_error, absolute_error_std.
Required:
  • Set at least one visualization: error_plot, error_distr, error_normality.
Optional:
  • No reference: Fails if MAE is higher than the dummy model predicting the median target value.
  • With reference: Fails if MAE differs by >10%.
RMSE()
  • Calculates Root Mean Square Error (RMSE).
  • Metric result: rmse.
Optional:
  • No reference: Fails if RMSE is higher than the dummy model predicting the mean target value.
  • With reference: Fails if RMSE differs by >10%.
MAPE()
  • Calculates Mean Absolute Percentage Error (MAPE).
  • Visualizations available: Percentage Error Plot.
  • Metric result: mean_perc_absolute_error, perc_absolute_error_std.
Required:
  • Set at least one visualization: perc_error_plot.
Optional:
  • No reference: Fails if MAPE is higher than the dummy model predicting the weighted median target value.
  • With reference: Fails if MAPE differs by >10%.
R2Score()
  • Calculates R² (Coefficient of Determination).
  • Metric result: r2score.
Optional:
  • No reference: Fails if R² ≤ 0.
  • With reference: Fails if R² differs by >10%.
AbsMaxError()
  • Calculates Absolute Maximum Error.
  • Metric result: abs_max_error.
Optional:
  • No reference: Fails if absolute maximum error is higher than the dummy model predicting the median target value.
  • With reference: Fails if it differs by >10%.

Dummy metrics:

Ranking

Use to evaluate ranking, search / retrieval or recommendations.

Data definition: map prediction and target columns and ranking type. Some metrics require additional training data.

Metric explainers. Check ranking metrics explainers.
MetricDescriptionParametersTest Defaults
RecSysPreset()
  • Larget Preset.
  • Includes a range of recommendation system metrics.
  • Metric result: all metrics.
  • See Preset page.
None.As in individual metrics.
RecallTopK()
  • Calculates Recall at the top K retrieved items.
  • Metric result: value.
Required:
  • k
Optional:
  • No reference: Tests if recall > 0.
  • With reference: Fails if Recall differs by >10%.
FBetaTopK()
  • Calculates F-beta score at the top K retrieved items.
  • Metric result: value.
Required:
  • k
Optional:
  • No reference: Tests if F-beta > 0.
  • With reference: Fails if F-beta differs by >10%.
PrecisionTopK()
  • Calculates Precision at the top K retrieved items.
  • Metric result: value.
Required:
  • k
Optional:
  • No reference: Tests if Precision > 0.
  • With reference: Fails if Precision differs by >10%.
MAP()
  • Calculates Mean Average Precision at the top K retrieved items.
  • Metric result: value.
Required:
  • k
Optional:
  • No reference: Tests if MAP > 0.
  • With reference: Fails if MAP differs by >10%.
NDCG()
  • Calculates Normalized Discounted Cumulative Gain at the top K retrieved items.
  • Metric result: value.
Required:
  • k
Optional:
  • No reference: Tests if NDCG > 0.
  • With reference: Fails if NDCG differs by >10%.
MRR()
  • Calculates Mean Reciprocal Rank at the top K retrieved items.
  • Metric result: value.
Required:
  • k
Optional:
  • No reference: Tests if MRR > 0.
  • With reference: Fails if MRR differs by >10%.
HitRate()
  • Calculates Hit Rate at the top K retrieved items.
  • Metric result: value.
Required:
  • k
Optional:
  • No reference: Tests if Hit Rate > 0.
  • With reference: Fails if Hit Rate differs by >10%.
ScoreDistribution()
  • Computes the predicted score entropy (KL divergence).
  • Applies only when the recommendations_type is a score..
  • Metric result: value.
Required:
  • k
Optional:
  • No reference:value
  • With reference: value.
Personalization() (Coming soon)
  • Calculates Personalization score at the top K recommendations.
  • Metric result: value.
Required:
  • k
Optional:
  • No reference: Tests if Personalization > 0.
  • With reference: Fails if Personalization differs by >10%.
ARP() (Coming soon)
  • Computes Average Recommendation Popularity at the top K recommendations.
  • Requires a training dataset.
  • Metric result: value.
Required:
  • k
Optional:
  • No reference: Tests if ARP > 0.
  • With reference: Fails if ARP differs by >10%.
Coverage()(Coming soon)
  • Calculates Coverage at the top K recommendations.
  • Requires a training dataset.
  • Metric result: value.
Required:
  • k
Optional:
  • No reference: Tests if Coverage > 0.
  • With reference: Fails if Coverage differs by >10%.
GiniIndex()(Coming soon)
  • Calculates Gini Index at the top K recommendations.
  • Requires a training dataset.
  • Metric result: value.
Required:
  • k
Optional:
  • No reference: Tests if Gini Index < 1.
  • With reference: Fails if Gini Index differs by >10%.
Diversity() (Coming soon)
  • Calculates Diversity at the top K recommendations.
  • Requires item features.
  • Metric result: value.
Required:
  • k
  • item_features
Optional:
  • No reference: Tests if Diversity > 0.
  • With reference: Fails if Diversity differs by >10%.
Serendipity()(Coming soon)
  • Calculates Serendipity at the top K recommendations.
  • Requires a training dataset.
  • Metric result: value.
Required:
  • k
  • item_features
Optional:
  • No reference: Tests if Serendipity > 0.
  • With reference: Fails if Serendipity differs by >10%.
Novelty() (Coming soon)
  • Calculates Novelty at the top K recommendations.
  • Requires a training dataset.
  • Metric result: value.
Required:
  • k
Optional:
  • No reference: Tests if Novelty > 0.
  • With reference: Fails if Novelty differs by >10%.

Relevant for RecSys metrics:

  • no_feedback_user: bool = False. Specifies whether to include the users who did not select any of the items, when computing the quality metric. Default: False.

  • min_rel_score: Optional[int] = None. Specifies the minimum relevance score to consider relevant when calculating the quality metrics for non-binary targets (e.g., if a target is a rating or a custom score).