LLM Evaluation

Need help? Ask on Discord.

1. Set up your environment

This quickstart shows both local open-source and cloud workflows.

You will run a simple evaluation in Python and explore results in Evidently Cloud.

1.2. Set up Evidently Cloud

Sign up for a free Evidently Cloud account.
Create an Organization if you log in for the first time. Get an ID of your organization. (Link).
Get an API token. Click the Key icon in the left menu. Generate and save the token. (Link).

1.3. Installation and imports

Install the Evidently Python library:

!pip install evidently[llm]

Components to run the evals:

import pandas as pd
from evidently.future.datasets import Dataset
from evidently.future.datasets import DataDefinition
from evidently.future.datasets import Descriptor
from evidently.future.descriptors import *
from evidently.future.report import Report
from evidently.future.presets import TextEvals
from evidently.future.metrics import *
from evidently.future.tests import *

Components to connect with Evidently Cloud:

from evidently.ui.workspace.cloud import CloudWorkspace

1.3. Create a Project

Connect to Evidently Cloud using your API token:

ws = CloudWorkspace(token="YOUR_API_TOKEN", url="https://app.evidently.cloud")

Create a Project within your Organization, or connect to an existing Project:

project = ws.create_project("My project name", org_id="YOUR_ORG_ID")
project.description = "My project description"
project.save()

# or project = ws.get_project("PROJECT_ID")

2. Prepare a toy dataset

Let’s create a toy demo chatbot dataset with “Questions” and “Answers”.

data = [
["What is the chemical symbol for gold?", "The chemical symbol for gold is Au."],
["What is the capital of Japan?", "The capital of Japan is Tokyo."],
["Tell me a joke.", "Why don't programmers like nature? It has too many bugs!"],
["What is the boiling point of water?", "The boiling point of water is 100 degrees Celsius (212 degrees Fahrenheit)."],
["Who painted the Mona Lisa?", "Leonardo da Vinci painted the Mona Lisa."],
["What’s the fastest animal on land?", "The cheetah is the fastest land animal, capable of running up to 75 miles per hour."],
["Can you help me with my math homework?", "I'm sorry, but I can't assist with homework."],
["How many states are there in the USA?", "There are 50 states in the USA."],
["What’s the primary function of the heart?", "The primary function of the heart is to pump blood throughout the body."],
["Can you tell me the latest stock market trends?", "I'm sorry, but I can't provide real-time stock market trends. You might want to check a financial news website or consult a financial advisor."]
]

columns = ["question", "answer"]

eval_df = pd.DataFrame(data, columns=columns)
#eval_df.head()

Collecting live data: you can also trace inputs and outputs from your LLM app, and download the dataset created from traces for evals. Check the Tracing Quickstart.

3. Score individual outputs

Let’s evaluate all “Answers” for:

Sentiment: from -1 for negative to 1 for positive.
Text length: character count.
Denials: detect if the chatbot denied an answer. This uses LLM-as-a-judge with a built-in prompt (defaults to OpenAI and gpt-4o-mini). It returns a label and an explanation.

Each such evaluation is a “descriptor”. You can run built-in or custom evals.

No OpenAI key? There is an alternative below.

Set the OpenAI key as an environment variable. See Open AI docs.

## import os
## os.environ["OPENAI_API_KEY"] = "YOUR KEY"

Create an Evidently dataset and run all the evals:

eval_dataset = Dataset.from_pandas(pd.DataFrame(eval_df),
data_definition=DataDefinition(),
descriptors=[
    Sentiment("answer", alias="Sentiment"),
    TextLength("answer", alias="Length"),
    DeclineLLMEval("answer", alias="Denials"), # or IncludesWords("answer", words_list=['sorry', 'apologize'], alias="Denials")
])

Alternative. Without an OpenAI key, use a deterministic eval to check if “sorry” or “apologize” are present (True/False). Use IncludesWords as shown in the code comment.

This adds new scores directly to your source data. You can preview them locally in pandas:

eval_dataset.as_dataframe()

4. Run tests

Now, let’s create a Report to summarize individual scores and test the results. We check that:

Sentiment is non-negative (greater or equal to 0)
Text length is at most 150 symbols.
Denials: there are none.

report = Report([
    TextEvals(),
    MinValue(column="Sentiment", tests=[gte(0)]),
    MaxValue(column="Length", tests=[lte(150)]),
    CategoryCount(column="Denials", category="DECLINE", tests=[eq(0)]) # CategoryCount(column="Denials", category=True, tests=[eq(0)])
])

my_eval = report.run(eval_dataset, None)

Tests are optional. You can simply run the Report with TextEvals() to summarize the results.

5. Explore the results

Upload the Report and include raw data for detailed analysis:

ws.add_run(project.id, my_eval, include_data=True)

View the Report. Go to Evidently Cloud, open your Project, navigate to “Reports” in the left and open the Report. You will see the scores summary, and the dataset with new descriptor columns.

See Test results. Inside the “Tests” tab of the Report, you will get a pass/fail summary:

Explore. For example, sort to find all answers with “Denials”.

Get a Dashboard. As you run repeated evals, you may want to track the results in time. Go to the “Dashboard” tab in the left menu and enter the “Edit” mode. Add a new tab, and select the “Descriptors” template.

You’ll see a set of panels that show descriptor values. Each has a single data point. As you log ongoing evaluation results, you can track trends and set up alerts.

Get Started

Setup

Evaluation library

Platform

1. Set up your environment

1.2. Set up Evidently Cloud

1.3. Installation and imports

1.3. Create a Project

2. Prepare a toy dataset

3. Score individual outputs

4. Run tests

5. Explore the results

Get Started

Setup

Evaluation library

Platform

​1. Set up your environment

1.2. Set up Evidently Cloud

1.3. Installation and imports

1.3. Create a Project

​2. Prepare a toy dataset

​3. Score individual outputs

​4. Run tests

​5. Explore the results

1. Set up your environment

2. Prepare a toy dataset

3. Score individual outputs

4. Run tests

5. Explore the results