Descriptors
How to run evaluations for text data.
For a general introduction, check Core Concepts.
Generate descriptors
Imports
Note. Some Descriptors that use vocabulary-based checks (like OOVWordsPercentage()
for out-of-vocabulary words) require downloading nltk
dictionaries:
Basic flow
Add scores when creating Dataset. Create a Dataset
from a dataframe df
with text data, set the Data Definition and add descriptors at once. Specify the column you are generating scores for:
Add scores to existing Dataset. You can also add descriptors to the Dataset object later using add_descriptors
. For example, first set the data schema:
Then, add the scores to this Dataset:
Export results. You can get the Dataset with newly added descriptors as a DataFrame:
Customization
All descriptors and parameters. See a reference table with all descriptors and parameters.
Alias. It is best to add an alias
to each Descriptor to make it easier to reference. This name shows up in visualizations and column headers. It’s especially handy if you’re using checks like regular expressions with word lists, where the auto-generated title could get very long.
Descriptor parameters. Some Descriptors have required parameters. For example, if you’re testing for competitor mentions using the Contains
Descriptor, add the list of items
:
Multi-column descriptors. Some evals use more than one column. For example, to match a new answer against reference, or measure semantic similarity. In this case, pass both columns using parameters:
LLM-as-a-judge. There are also built-in descriptors that prompt an external LLM to return an evaluation score. You can add them like any other descriptor, but you must also provide an API key to use the corresponding LLM.
Custom LLM evals. Check the LLM judge guide on using built-in and custom LLM-based evaluators.
Custom programmatic evals. You can also add checks via custom Python functions.
Get a Report
Once you computed descriptors, you can summarize the results using Reports. This lets you get stats for all descriptors, visualize their distributions and run conditional tests.
Imports
Text Evals
The easiest way to get the Report is through TextEvals
Preset: it instantly summarizes all Descriptor values for a specific column.
To configure the Report and run it for eval_dataset
You can view the Report in Python, export the outputs (HTML, JSON, Python dictionary) or upload it to the Evidently platform. Check more in output formats.
Using Metrics
Under the hood, the TextEvals
Preset generates ValueStats
Metrics for each Descriptor. To have more control or use other available Metrics, you can create a custom Report referencing descriptors just like any other column in the dataset.
Custom Report. For example, you can visualize only the mean values of descriptors:
Drift detection. You can also run more complex checks, like comparing the distribution of text length between two batches of data. (This requires two datasets).
Run Tests
You can add test conditions to Metrics to obtain Pass/Fail results. For example:
-
Test that no response has a negative Sentiment (lower than 0).
-
Test that no response has a Length of over 150 symbols.
This adds a Test Suite to the Report for clear pass/fail outcomes.
You can use different Tests depending on the column type. For example, to check that the chatbot does not deny an answer, use CategoryCount
and Test that there are no True
labels in this column (meaning, no competitor mentions detected).
Report and Tests API. Check separate guides on generating Reports and setting Test conditions.
List of all Metrics. Check the Reference table. Consider using column-level Metrics like MeanValue
, MeanValue
, MaxValue
, QuantileValue
, OutRangeValueCount
and CategoryCount
.