Descriptors

For a general introduction, check Core Concepts.

Generate descriptors

Imports

from evidently.future.datasets import Dataset
from evidently.future.datasets import DataDefinition
from evidently.future.datasets import Descriptor
from evidently.future.descriptors import *

Note. Some Descriptors that use vocabulary-based checks (like OOVWordsPercentage() for out-of-vocabulary words) require downloading nltk dictionaries:

nltk.download('words')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('vader_lexicon')

Basic flow

Use the following code to generate toy data for this guide

import pandas as pd

data = [
    ["What is the chemical symbol for gold?", "The chemical symbol for gold is Au."],
    ["What is the capital of Japan?", "The capital of Japan is Tokyo."],
    ["Tell me a joke.", "Why don't programmers like nature? It has too many bugs!"],
    ["What is the boiling point of water?", "The boiling point of water is 100 degrees Celsius (212 degrees Fahrenheit)."],
    ["Who painted the Mona Lisa?", "Leonardo da Vinci painted the Mona Lisa."],
    ["What’s the fastest animal on land?", "The cheetah is the fastest land animal, capable of running up to 75 miles per hour."],
    ["Can you help me with my math homework?", "I'm sorry, but I can't assist with homework. You might want to consult your teacher for help."],
    ["How many states are there in the USA?", "There are 50 states in the USA."],
    ["What’s the primary function of the heart?", "The primary function of the heart is to pump blood throughout the body."],
    ["Can you tell me the latest stock market trends?", "I'm sorry, but I can't provide real-time stock market trends. You might want to check a financial news website or consult a financial advisor."]
]

# Columns
columns = ["question", "answer"]

# Creating the DataFrame
df = pd.DataFrame(data, columns=columns)

Add scores when creating Dataset. Create a Dataset from a dataframe df with text data, set the Data Definition and add descriptors at once. Specify the column you are generating scores for:

eval_dataset = Dataset.from_pandas(
    pd.DataFrame(df),
    data_definition=DataDefinition(
        text_columns=["question", "answer"]),
    descriptors=[
        Sentiment("answer", alias="Sentiment"),
        TextLength("answer", alias="Length"),
        IncludesWords("answer", words_list=['sorry', 'apologize'], alias="Denials"),
    ]
)

Add scores to existing Dataset. You can also add descriptors to the Dataset object later using add_descriptors. For example, first set the data schema:

eval_dataset = Dataset.from_pandas(
    pd.DataFrame(df),
    data_definition=DataDefinition(text_columns=["question", "answer"]),
)

Then, add the scores to this Dataset:

eval_dataset.add_descriptors(descriptors=[
    Sentiment("answer", alias="Sentiment"),
    TextLength("answer", alias="Length"),
    IncludesWords("answer", words_list=['sorry', 'apologize'], alias="Denials"),
])

Export results. You can get the Dataset with newly added descriptors as a DataFrame:

eval_dataset.as_dataframe()

Customization

All descriptors and parameters. See a reference table with all descriptors and parameters.

Alias. It is best to add an alias to each Descriptor to make it easier to reference. This name shows up in visualizations and column headers. It’s especially handy if you’re using checks like regular expressions with word lists, where the auto-generated title could get very long.

eval_dataset.add_descriptors(descriptors=[
    WordCount("answer", alias="Words"),
])

Descriptor parameters. Some Descriptors have required parameters. For example, if you’re testing for competitor mentions using the Contains Descriptor, add the list of items:

eval_dataset.add_descriptors(descriptors=[
    Contains("answer", items=["AcmeCorp", "YetAnotherCorp"], alias="Competitors")
])

Multi-column descriptors. Some evals use more than one column. For example, to match a new answer against reference, or measure semantic similarity. In this case, pass both columns using parameters:

eval_dataset.add_descriptors(descriptors=[
    SemanticSimilarity(columns=["question", "answer"], alias="Semantic_Match")
])

LLM-as-a-judge. There are also built-in descriptors that prompt an external LLM to return an evaluation score. You can add them like any other descriptor, but you must also provide an API key to use the corresponding LLM.

eval_dataset.add_descriptors(descriptors=[
    DeclineLLMEval("answer", alias="Contains_Denial")
])

Custom LLM evals. Check the LLM judge guide on using built-in and custom LLM-based evaluators.

Custom programmatic evals. You can also add checks via custom Python functions.

Get a Report

Once you computed descriptors, you can summarize the results using Reports. This lets you get stats for all descriptors, visualize their distributions and run conditional tests.

Imports

from evidently.future.report import Report
from evidently.future.presets import TextEvals
from evidently.future.metrics import *
from evidently.future.tests import *

Text Evals

The easiest way to get the Report is through TextEvals Preset: it instantly summarizes all Descriptor values for a specific column.

To configure the Report and run it for eval_dataset

report = Report([
    TextEvals()
])
my_eval = report.run(eval_dataset, None)

You can view the Report in Python, export the outputs (HTML, JSON, Python dictionary) or upload it to the Evidently platform. Check more in output formats.

my_eval
# my_eval.json()
# ws.add_report(project.id, my_eval, include_data=True)

Using Metrics

Under the hood, the TextEvals Preset generates ValueStats Metrics for each Descriptor. To have more control or use other available Metrics, you can create a custom Report referencing descriptors just like any other column in the dataset.

Custom Report. For example, you can visualize only the mean values of descriptors:

custom_report = Report([
    MeanValue(column="Length"),
    MeanValue(column="Sentiment")
])

my_custom_eval = custom_report.run(eval_dataset, None)
my_custom_eval.json()

Drift detection. You can also run more complex checks, like comparing the distribution of text length between two batches of data. (This requires two datasets).

custom_report = Report([
    ValueDrift(column="Length"),
])

my_custom_eval = custom_report.run(eval_dataset, eval_dataset)
my_custom_eval.json()

Run Tests

You can add test conditions to Metrics to obtain Pass/Fail results. For example:

Test that no response has a negative Sentiment (lower than 0).
Test that no response has a Length of over 150 symbols.

tests = Report([
    MinValue(column="Sentiment", tests=[gte(0)]),
    MaxValue(column="Length", tests=[lte(150)]),
])

my_test_eval = tests.run(eval_dataset, None)
my_test_eval
# my_test_eval.json()

This adds a Test Suite to the Report for clear pass/fail outcomes.

You can use different Tests depending on the column type. For example, to check that the chatbot does not deny an answer, use CategoryCount and Test that there are no True labels in this column (meaning, no competitor mentions detected).

from evidently.future.metrics.column_statistics import CategoryCount

tests = Report([
    CategoryCount(column="Denials", category=True, count_tests=[eq(0)])
])

my_test_eval = tests.run(eval_dataset, None)
my_test_eval
# my_test_eval.json()

Report and Tests API. Check separate guides on generating Reports and setting Test conditions.

List of all Metrics. Check the Reference table. Consider using column-level Metrics like MeanValue, MeanValue, MaxValue, QuantileValue, OutRangeValueCount and CategoryCount.

Get Started

Setup

Evaluation library

Platform

Generate descriptors

Imports

Basic flow

Customization

Get a Report

Imports

Text Evals

Using Metrics

Run Tests

Get Started

Setup

Evaluation library

Platform

​Generate descriptors

​Imports

​Basic flow

​Customization

​Get a Report

​Imports

​Text Evals

​Using Metrics

​Run Tests

Generate descriptors

Imports

Basic flow

Customization

Get a Report

Imports

Text Evals

Using Metrics

Run Tests