For a general introduction, check Core Concepts.

Generate descriptors

Imports

from evidently.future.datasets import Dataset
from evidently.future.datasets import DataDefinition
from evidently.future.datasets import Descriptor
from evidently.future.descriptors import *

Note. Some Descriptors that use vocabulary-based checks (like OOVWordsPercentage() for out-of-vocabulary words) require downloading nltk dictionaries:

nltk.download('words')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('vader_lexicon')

Basic flow

Add scores when creating Dataset. Create a Dataset from a dataframe df with text data, set the Data Definition and add descriptors at once. Specify the column you are generating scores for:

eval_dataset = Dataset.from_pandas(
    pd.DataFrame(df),
    data_definition=DataDefinition(
        text_columns=["question", "answer"]),
    descriptors=[
        Sentiment("answer", alias="Sentiment"),
        TextLength("answer", alias="Length"),
        IncludesWords("answer", words_list=['sorry', 'apologize'], alias="Denials"),
    ]
)

Add scores to existing Dataset. You can also add descriptors to the Dataset object later using add_descriptors. For example, first set the data schema:

eval_dataset = Dataset.from_pandas(
    pd.DataFrame(df),
    data_definition=DataDefinition(text_columns=["question", "answer"]),
)

Then, add the scores to this Dataset:

eval_dataset.add_descriptors(descriptors=[
    Sentiment("answer", alias="Sentiment"),
    TextLength("answer", alias="Length"),
    IncludesWords("answer", words_list=['sorry', 'apologize'], alias="Denials"),
])

Export results. You can get the Dataset with newly added descriptors as a DataFrame:

eval_dataset.as_dataframe()

Customization

All descriptors and parameters. See a reference table with all descriptors and parameters.

Alias. It is best to add an alias to each Descriptor to make it easier to reference. This name shows up in visualizations and column headers. It’s especially handy if you’re using checks like regular expressions with word lists, where the auto-generated title could get very long.

eval_dataset.add_descriptors(descriptors=[
    WordCount("answer", alias="Words"),
])

Descriptor parameters. Some Descriptors have required parameters. For example, if you’re testing for competitor mentions using the Contains Descriptor, add the list of items:

eval_dataset.add_descriptors(descriptors=[
    Contains("answer", items=["AcmeCorp", "YetAnotherCorp"], alias="Competitors")
])

Multi-column descriptors. Some evals use more than one column. For example, to match a new answer against reference, or measure semantic similarity. In this case, pass both columns using parameters:

eval_dataset.add_descriptors(descriptors=[
    SemanticSimilarity(columns=["question", "answer"], alias="Semantic_Match")
])

LLM-as-a-judge. There are also built-in descriptors that prompt an external LLM to return an evaluation score. You can add them like any other descriptor, but you must also provide an API key to use the corresponding LLM.

eval_dataset.add_descriptors(descriptors=[
    DeclineLLMEval("answer", alias="Contains_Denial")
])

Custom LLM evals. Check the LLM judge guide on using built-in and custom LLM-based evaluators.

Custom programmatic evals. You can also add checks via custom Python functions.

Get a Report

Once you computed descriptors, you can summarize the results using Reports. This lets you get stats for all descriptors, visualize their distributions and run conditional tests.

Imports

from evidently.future.report import Report
from evidently.future.presets import TextEvals
from evidently.future.metrics import *
from evidently.future.tests import *

Text Evals

The easiest way to get the Report is through TextEvals Preset: it instantly summarizes all Descriptor values for a specific column.

To configure the Report and run it for eval_dataset

report = Report([
    TextEvals()
])
my_eval = report.run(eval_dataset, None)

You can view the Report in Python, export the outputs (HTML, JSON, Python dictionary) or upload it to the Evidently platform. Check more in output formats.

my_eval
# my_eval.json()
# ws.add_report(project.id, my_eval, include_data=True)

Using Metrics

Under the hood, the TextEvals Preset generates ValueStats Metrics for each Descriptor. To have more control or use other available Metrics, you can create a custom Report referencing descriptors just like any other column in the dataset.

Custom Report. For example, you can visualize only the mean values of descriptors:

custom_report = Report([
    MeanValue(column="Length"),
    MeanValue(column="Sentiment")
])

my_custom_eval = custom_report.run(eval_dataset, None)
my_custom_eval.json()

Drift detection. You can also run more complex checks, like comparing the distribution of text length between two batches of data. (This requires two datasets).

custom_report = Report([
    ValueDrift(column="Length"),
])

my_custom_eval = custom_report.run(eval_dataset, eval_dataset)
my_custom_eval.json()

Run Tests

You can add test conditions to Metrics to obtain Pass/Fail results. For example:

  • Test that no response has a negative Sentiment (lower than 0).

  • Test that no response has a Length of over 150 symbols.

tests = Report([
    MinValue(column="Sentiment", tests=[gte(0)]),
    MaxValue(column="Length", tests=[lte(150)]),
])

my_test_eval = tests.run(eval_dataset, None)
my_test_eval
# my_test_eval.json()

This adds a Test Suite to the Report for clear pass/fail outcomes.

You can use different Tests depending on the column type. For example, to check that the chatbot does not deny an answer, use CategoryCount and Test that there are no True labels in this column (meaning, no competitor mentions detected).

from evidently.future.metrics.column_statistics import CategoryCount

tests = Report([
    CategoryCount(column="Denials", category=True, count_tests=[eq(0)])
])

my_test_eval = tests.run(eval_dataset, None)
my_test_eval
# my_test_eval.json()

Report and Tests API. Check separate guides on generating Reports and setting Test conditions.

List of all Metrics. Check the Reference table. Consider using column-level Metrics like MeanValue, MeanValue, MaxValue, QuantileValue, OutRangeValueCount and CategoryCount.