LLM Evaluation
Evaluate text outputs in under 5 minutes
Need help? Ask on Discord.
1. Set up your environment
This quickstart shows both local open-source and cloud workflows.
You will run a simple evaluation in Python and explore results in Evidently Cloud.
1.2. Set up Evidently Cloud
-
Sign up for a free Evidently Cloud account.
-
Create an Organization if you log in for the first time. Get an ID of your organization. (Link).
-
Get an API token. Click the Key icon in the left menu. Generate and save the token. (Link).
1.3. Installation and imports
Install the Evidently Python library:
Components to run the evals:
Components to connect with Evidently Cloud:
1.3. Create a Project
Connect to Evidently Cloud using your API token:
Create a Project within your Organization, or connect to an existing Project:
2. Prepare a toy dataset
Let’s create a toy demo chatbot dataset with “Questions” and “Answers”.
3. Score individual outputs
Let’s evaluate all “Answers” for:
-
Sentiment: from -1 for negative to 1 for positive.
-
Text length: character count.
-
Denials: detect if the chatbot denied an answer. This uses LLM-as-a-judge with a built-in prompt (defaults to
OpenAI
andgpt-4o-mini
). It returns a label and an explanation.
Each such evaluation is a “descriptor”. You can run built-in or custom evals.
No OpenAI key? There is an alternative below.
Set the OpenAI key as an environment variable. See Open AI docs.
Create an Evidently dataset and run all the evals:
Alternative. Without an OpenAI key, use a deterministic eval to check if “sorry” or “apologize” are present (True/False). Use IncludesWords
as shown in the code comment.
This adds new scores directly to your source data. You can preview them locally in pandas:
4. Run tests
Now, let’s create a Report to summarize individual scores and test the results. We check that:
-
Sentiment is non-negative (greater or equal to 0)
-
Text length is at most 150 symbols.
-
Denials: there are none.
Tests are optional. You can simply run the Report with TextEvals()
to summarize the results.
5. Explore the results
Upload the Report and include raw data for detailed analysis:
View the Report. Go to Evidently Cloud, open your Project, navigate to “Reports” in the left and open the Report. You will see the scores summary, and the dataset with new descriptor columns.
See Test results. Inside the “Tests” tab of the Report, you will get a pass/fail summary:
Explore. For example, sort to find all answers with “Denials”.
Get a Dashboard. As you run repeated evals, you may want to track the results in time. Go to the “Dashboard” tab in the left menu and enter the “Edit” mode. Add a new tab, and select the “Descriptors” template.
You’ll see a set of panels that show descriptor values. Each has a single data point. As you log ongoing evaluation results, you can track trends and set up alerts.