LLM-based descriptors use an external LLM for evaluation. You can:

  • Use built-in evaluators (with pre-written prompts), or

  • Run evals for custom criteria you configure.

Pre-requisites:

  • You know how to use descriptors to evaluate text data.

Imports

from evidently.future.datasets import Descriptor
from evidently.features.llm_judge import BinaryClassificationPromptTemplate
from evidently.future.descriptors import LLMEval, ToxicityLLMEval, ContextQualityLLMEval, DeclineLLMEval

Built-in LLM evals

Available descriptors. Check all available built-in LLM evals in the reference table.

There are built-in evaluators for popular criteria, like detecting toxicity or if the text contains a refusal. These built-in descriptors:

  • Default to binary classifiers.

  • Default to using gpt-4o-mini model from OpenAI.

  • Return a label, the reasoning for the decision, and an optional score.

OpenAI key. Add the token as the environment variable: see docs.

import os
os.environ["OPENAI_API_KEY"] 

Run a single-column eval. For example, to evaluate whether responsecontains any toxicity:

eval_df.add_descriptors(descriptors=[
    ToxicityLLMEval("response", alias="toxicity"),
])

View the results as usual:

eval_df.as_dataframe()

Example output:

Run a multi-column eval. Some evaluators naturally require two columns. For example, to evaluate Context Quality (“does it have enough information to answer the question?”), you must run this evaluation over your context column, and pass the question column as a parameter.

eval_df.add_descriptors(descriptors=[
    ContextQualityLLMEval("context", alias="good_context", question="question"),
])

Example output:

Parametrize evaluators. You can switch the output format from category to score (0 to 1) or exclude the reasoning to get only the label:

eval_df.add_descriptors(descriptors=[
    DeclineLLMEval("response", alias="refusal", include_reasoning=False),
    ToxicityLLMEval("response", alias="toxicity", include_category=False),
    PIILLMEval("response", alias="PII", include_score=True), 
])

Column names. The alias you set defines the column name with the category. If you enable the score result as well, it will get the “Alias score” name.

Change the LLM. To choose to use a different model for the evals.

eval_df.add_descriptors(descriptors=[
    ToxicityLLMEval("response", alias="toxicity", provider="openai", model="gpt-3.5-turbo"),
])

Custom LLM evals

You can also create a custom LLM evaluator using the provided templates:

  • Choose a template.

  • Specify the evaluation criteria (grading logic and names of categories)

Evidently will then generate the complete evaluation prompt to send to the selected LLM together with the evaluation data.

Evaluate a single column

Binary classification template. For example, to evaluate if the text is “concise”:

conciseness = BinaryClassificationPromptTemplate(
        criteria = """Conciseness refers to the quality of being brief and to the point, while still providing all necessary information.
            A concise response should:
            - Provide the necessary information without extra details or repetition.
            - Be brief yet comprehensive enough to address the query.
            - Use simple and direct language to convey the message effectively.
        """,
        target_category="concise",
        non_target_category="verbose",
        uncertainty="unknown",
        include_reasoning=True,
        pre_messages=[("system", "You are a judge which evaluates text.")],
        )      

You do not need to explicitly ask the LLM to classify your input into two classes, return reasoning, or format the output. This is already part of the Evidently template.

To apply this descriptor for your data, pass the template name to the LLMEval descriptor:

eval_df.add_descriptors(descriptors=[
    LLMEval("response", 
            template=conciseness, 
            provider = "openai", 
            model = "gpt-4o-mini", 
            alias="Conciseness"),
    ])

Publish results as usual:

eval_df.as_dataframe()

Another example. This template is very flexible. For instance, you can use to decide if the question is appropriate to the scope of your LLM application. A simplified prompt:

appropriate_scope = BinaryClassificationPromptTemplate(
        pre_messages=[("system", "You are a judge which evaluates questions sent to a student tutoring app.")],
        criteria = """An appropriate question is any educational query related to
        - academic subjects (e.g., math, science, history)
        - general world knowledge or skills
        An inappropriate question is any question that is:
        - unrelated to educational goals, such as personal preferences, pranks, or opinions
        - offensive or aimed to provoke a biased response.
        """,
        target_category="appropriate",
        non_target_category="inappropriate",
        uncertainty="unknown",
        include_reasoning=True,
        )

Apply the template:

eval_df.add_descriptors(descriptors=[
    LLMEval("question", 
            template=appropriate_scope, 
            provider = "openai", 
            model = "gpt-4o-mini", 
            alias="appropriate_q"),
    ])

Example output:

Evaluate multiple columns

A custom evaluator can also use multiple columns. To implement this, mention the second {column_name} inside your evaluation criteria.

Example. To evaluate if the response is faithful to the context:

hallucination = BinaryClassificationPromptTemplate(
        pre_messages=[("system", "You are a judge which evaluates correctness of responses by comparing them to the trusted information source.")],
        criteria = """An hallucinated response is any response that
        - Contradicts the information provided in the source.
        - Adds any new information not provided in the source.
        - Gives a response not based on the source, unless it's a refusal or a clarifying question.

        A faithful response is the response that
        - Correctly uses the information from the source, even if it only partially.
        - A response that declines to answer.
        - A response that asks a clarifying question.

        Source:
        =====
        {context}
        =====
        """,
        target_category="hallucinated",
        non_target_category="faithful",
        uncertainty="unknown",
        include_reasoning=True,
        )

You do not need to include the primary column name in the evaluation prompt, since it’s already part of the template. You choose this column when you apply the descriptor.

When applying the descriptor, include the second column using the additional_columns parameter:

eval_df.add_descriptors(descriptors=[
    LLMEval("response", 
            template=hallucination, 
            provider = "openai", 
            model = "gpt-4o-mini", 
            alias="hallucination", 
            additional_columns={"context": "context"}),
])

Get the results as usual:

eval_df.as_dataframe()

Example output:

Parameters

LLMEval

ParameterDescriptionOptions
templateSets a specific template for evaluation.BinaryClassificationPromptTemplate
providerThe provider of the LLM to be used for evaluation.openai
modelSpecifies the model used for evaluationAny available provider model (e.g., gpt-3.5-turbogpt-4
additional_columnsA dictionary of additional columns present in your dataset to include in the evaluation prompt.

Use it to map the column name to the placeholder name you reference in the criteria. For example: ({"mycol": "question"}.
Custom dictionary (optional)

BinaryClassificationPromptTemplate

ParameterDescriptionOptions
criteriaFree-form text defining evaluation criteria.Custom string (required)
target_categoryName of the target category you want to detect (e.g., you care about its precision/recall more than the other).

The choice of “target” category has no impact on the evaluation itself. However, it can be useful for later quality evaluations of your LLM judge.
Custom category (required)
non_target_categoryName of the non-target category.Custom category (required)
uncertaintyCategory to return when the provided information is not sufficient to make a clear determination.unknown (Default), target, non_target
include_reasoningSpecifies whether to include the LLM-generated explanation of the result.True (Default), False
pre_messagesList of system messages that set context or instructions before the evaluation task.

Use it to explain the evaluator role (“you are an expert..”) or context (“your goal is to grade the work of an intern..”).
Custom string (optional)