Introduction
Core concepts and components of the Evidently Python library.
The Evidently Python library is an open-source tool designed to evaluate, test and monitor the quality of AI systems, from experimentation to production. You can use the evaluation library on its own, or as part of the Monitoring Platform (self-hosted or Evidently Cloud).
This page gives a closer look at the library.
At a glance
Here is how the core evaluation workflow works:
Pass the input data
Pass text, tabular or embeddings data to evaluate.
Run evaluations
Use Presets
or configure your own Report. Optionally, add Tests
to get pass/fail results. There are 100+ built-in evals. You can also add yours. They cover:
Type | Example checks |
---|---|
🔡 Text qualities | Length, sentiment, special symbols, pattern matches, etc. |
📝 LLM outputs | Semantic similarity, relevance, faithfulness, custom LLM judges, etc. |
🛢 Data quality | Missing values, duplicates, min-max ranges, correlations, etc. |
📊 Data drift | 20+ tests and distance metrics to detect distribution drift. |
🎯 Classification | Accuracy, precision, recall, ROC AUC, confusion matrix, bias, etc. |
📈 Regression | MAE, ME, RMSE, error distribution, error normality, error bias, etc. |
🗂 Ranking (inc. RAG) | NDCG, MAP, MRR, Hit Rate, etc. |
🛒 Recommenders | Serendipity, novelty, diversity, popularity bias, etc. |
Explore results
-
View visual Reports directly in Python environments like Jupyter or Colab.
-
Export as JSON, Python dictionary, pandas DataFrame, or HTML file.
-
Upload to Evidently Platform to store and track over time.
Exportability makes it easy to integrate Evidently with existing workflows and tools.
Here an example Report with data overview. Other evaluations can be presented in the same way:
This page explains how evaluations work in Evidently. If you prefer code:
-
Try a quickstart for LLM evaluation or ML evaluation.
-
Read the step-by-step user guide.
That said, we recommend giving this a read once to understand the core library components.
Core concepts
Dataset
First, you need to get your input data ready. For example, generate outputs from your ML or LLM system that you want to evaluate.
-
Tabular data. Prepare your data as a pandas DataFrame.
-
Flexible structure. The table can include any combination of numerical, categorical, text, metadata (including timestamps or IDs), and embedding columns.
Create a Dataset. Once you have the data, you must create an Evidently Dataset
object.
Some evaluations may require specific columns or data types present. For example, to evaluate classification quality, you need both predictions and actual labels. To specify where they are located in your table, you can map the data schema using Data Definition.
Here are a few examples of data inputs Evidently can handle:
LLM logs. Pass any text columns with inputs/outputs, context or ground truth.
Question | Context | Answer |
---|---|---|
How old is the universe? | The universe is believed to have originated from the Big Bang that occurred 13.8 billion years ago. | 13.8 billion years old. |
What’s the lifespan of Baobab trees? | Baobab trees can live up to 2,500 years. They are often called the “Tree of Life”. | Up to 2,500 years. |
What is the speed of light? | The speed of light in a vacuum is approximately 299,792 kilometers per second (186,282 miles per second). | Close to 299,792 km per second. |
These are examples: you data can have other structure.
Two datasets. Typically you evaluate a single (current
) dataset. Optionally, you can include a second (reference
) dataset. Both must have identical structures. When to use two datasets:
-
Side-by-side comparison. This lets you compare outputs or data quality across two periods, prompt/model versions, etc.
-
Data drift detection. (Required). You can detect distribution shifts by comparing datasets, such as this week’s data to the previous one.
-
Simplify test setup. You can automatically generate test conditions (e.g., min-max ranges) from the reference dataset without manual configuration.
Data sampling. For large datasets (millions of rows), evals can take some time. The depends on:
-
the specific evaluation: some are more computationally intensive than others
-
your dataset: e.g., if you run column-level evals and have lots of columns
-
your infrastructure: data is processed in-memory.
If the computation takes too long, it’s often more efficient to use samples. For example, in data drift detection, you can apply random or stratified sampling.
Once your Dataset
is ready, you can run evaluations.
Descriptors
To evaluate text data and LLM outputs, you need Descriptors
.
A Descriptor is a row-level score or label that assesses a specific quality of a given text. It’s different from metrics (like accuracy or precision) that give a score for an entire dataset. You can use descriptors to assess LLM outputs in summarization, Q&A, chatbots, agents, RAGs, etc.
A simple example of a descriptor is TextLength
. Descriptors range from deterministic to complex ML-based checks. For example, LLM-based descriptors can help label responses as “relevant” or “not relevant” using an evaluation prompt.
Descriptors can also use two texts at once, like checking SemanticSimilarity
between two columns to compare new response to the reference one.
You can use built-in descriptors, configure templates (like LLM judges or regular expressions) or add custom checks in Python. Each Descriptor returns a result that can be:
-
Numerical. Any scores like symbol count or sentiment score.
-
Categorical. Labels or binary “true”/“false” results for pattern matches.
-
Text string. Like explanations generated by LLM.
Evidently adds the computed descriptor values directly to the dataset.
This helps with debugging: for example, you can sort to find the negative responses. You can view the results as a Pandas DataFrame or on the Evidently Platform.
After you get the Descriptors, you can use them to compute Metrics and Tests.
Reports and Metrics
A Report lets you structure and run evals on the dataset or column-level.
You can generate Reports after you get the descriptors, or for any existing dataset like a table with ML model logs. Use Reports to:
-
summarize the computed text descriptors across all inputs
-
analyze any tabular dataset (descriptive stats, quality, drift)
-
evaluate AI system performance (regression, classification, ranking, etc.)
Each Report runs a computation and visualizes a set of Metrics and conditional Tests. If you pass two datasets, you get a side-by-side comparison.
The easiest way to start is by using Presets.
Presets
Presets are pre-configured evaluation templates.
They help compute multiple related Metrics using a single line of code. Evidently has a number of comprehensive Presets (see all) for specific evaluation scenarios: from exploratory data analysis to AI quality assessments. For example:
TextEvals
summarizes the scores from all text descriptors.
Metrics
Each Preset is made of individual Metrics. You can also create your own custom Report by listing the Metrics
you want to include.
-
You can combine multiple Metrics and Presets in a Report.
-
You can include both built-in Metrics and custom Metrics.
Built-in Metrics range from simple statistics like MeanValue
or MissingValueCount
to complex algorithmic evals like DriftedColumnsCount
.
Each Metric computes a single value and has an optional visual representation (or several to choose from). For convenience, there are also small Presets that combine a handful of scores in a single widget, like ValueStats
that shows many relevant descriptive value statistics at once.
Similarly DatasetStats
give quick overview of all dataset-level stats, ClassificationQuality
computes multiple metrics like Precision, Recall, Accuracy, ROC AUC, etc.
Explore all Built-in Metrics.
Test Suites and Tests
Reports are great for analysis and debugging, or logging metrics during monitoring. However, in many cases, you don’t want to review all the scores but run a conditional check to confirm that nothing is off. In this case, Tests are a great option.
Tests
Tests let you validate your results against specific expectations. You create a Test by adding a condition parameter to a Metric. Each Test will calculate a given value, check it against the rule, and report a pass/fail result.
-
You can run multiple Tests in one go.
-
You can create Tests on the dataset or column level.
-
You can formulate custom conditions or use defaults.
A Test Suite is a collection of individual Tests. It works as an extension to a Report. Once you configure Tests, your Report will get an additional tab that shows a summary of outcomes.;
You can navigate the results by test outcome.
Each Test results in one of the following statuses:
-
Pass: The condition was met.
-
Fail: The condition wasn’t met.
-
Warning: The condition wasn’t met, but the check is marked as non-critical.
-
Error: Something went wrong with the Test itself, such as an execution error.
You can view extra details to debug. For example, if you run a Test to check that less than 5% of LLM responses fall outside the approved length, you can see the corresponding distribution:
Test Conditions
Evidently has a powerful API to set up Test conditions.
-
Manual setup. You can add thresholds to Metrics one by one, using simple syntax like greater than (
gt
) or less than (lt
). By picking different Metrics to test against, you can formulate fine-grained conditions like “less than 10% of texts can fall outside 10–100 character length.” -
Manual setup with reference. If you have a reference dataset (like a previous data batch), you can set conditions relative to it. For example, you can check if the min-max value range stays within ±5% of the reference range without setting exact thresholds.
-
Automatic setup. You can run any Test using built-in defaults. These are either:
-
Heuristics. For example, the Test on missing values assumes none should be preset.
-
Heuristics relative to reference. Here, conditions adjust to a reference. For instance, the Test on missing values assumes their share should stay within ±10% of the reference.
-
Test Presets
For even faster setup, there are Test Presets. Each Metric Preset has a corresponding Test Preset that you can enable as an add-on. When you do this:
-
Evidently adds a predefined set of Tests to your Report.
-
These Tests use default conditions, either static or inferred from the reference dataset.
For example:
-
Data Summary. The Metric Preset gives an overview and stats for all columns. The Test Suite checks for quality issues like missing values, duplicates, etc. across all values.
-
Classification. The Metric Preset shows quality metrics like precision or recall. The Test Suite verifies these metrics against a baseline, like a dummy baseline calculated by Evidently or previous model performance.
Building your workflow
You can use Evidently Reports and Test Suites on their own or as part of a monitoring system.
Independent use
Reports are great for exploratory evals:
-
Ad hoc evals. Run one-time analyses on your data, models or LLM outputs.
-
Experiments. Compare models, prompts, or datasets side by side.
-
Debugging. Investigate data or model issues.
Test Suites are great for automated checks like:
-
Data validation. Test inputs and outputs in prediction pipelines.
-
CI/CD and regression testing. Check AI system performance after updates.
-
Safety testing. Run structured behavioral tests like adversarial testing.
For automation, you can integrate Evidently with tools like Airflow. You can trigger actions based on Test results, such as sending alerts or halting a pipeline.
As part of platform
You can use Reports together with the Evidently Platform in production workflows:
-
Reports serve as a metric computation layer, running evaluations on your data.
-
The Platform lets you store, compare, track and alert on evaluation results.
Reports are stored as JSON files, which can be natively parsed to visualize metrics on a Dashboard.
This setup works for both experiments and production monitoring. For example:
-
Experiments. Log evaluations while experimenting with prompts or model versions. Use the Platform to compare runs and track progress.
-
Regression Tests. Use Test Suites to validate updates on your golden dataset. Debug failures and maintain a history of results on the Platform.
-
Batch Monitoring. Integrate Reports into your data pipelines to compute Metrics for data batches. Use the Platform for performance tracking and alerting.
Evidently Cloud also offers managed evaluations to generate Reports directly on the platform, and other features such as synthetic data and test generation.
Platform deployment options. You can choose:
-
Self-host the open-source platform version.
-
Sign up for Evidently Cloud (Recommended).
The Evidently Platform has additional features beyond evaluation: from synthetic data to tracing.