An experiment evaluates the performance of your LLM system (simple prompt or multi-step LLM chain/agent) against a dataset and a set of evaluation metrics.

An experiment logged on Literal AI

Run an experiment from your code

Complex multi-step LLM systems are heavily dependent on your code and instrastructure. Literal AI enables you to evaluate your LLM systems from your own code and then log the results on Literal AI.

Here is a naive example of how you can run an experiment with Literal AI:

See installation to get your API key and instantiate the SDK

The best way to run an experiment is to use a Dataset to store your inputs and expected outputs. This way you can track on which data your experiment was run and compare the results of different experiments.

Using a dataset to run an experiment is very similar to the previous example, except that you are iterating over the items of the dataset:

If you are evaluating a prompt living on Literal AI, you can bind it to the experiment to track the performance of the prompt.

Run an experiment on Literal AI

Experiments can be run directly on Literal AI. This allows you to run experiments without having to manage an infrastructure. However, experiments can only be used to evaluate prompts managed on Literal AI, see Prompt Management.

To run an experiment, you need:

  • a Prompt to iterate on
  • an Evaluator
  • a Dataset
1

Prompt to iterate on

From the Prompt Playground, make modifications to your prompt and save it:

Prompt to iterate on

Prompt to iterate on

2

Create an Evaluator

Currently, Literal AI supports AI evaluators, which you can configure from the Prompt Playground. AI Evaluators differ from typical prompts in that they specify a Structured Output to yield a Score.

From the Prompt Playground, create a prompt which generates a score with Structured Output:

Evaluator Prompt

Evaluator Prompt

3

Select a dataset

Select a dataset or create one. In the upper right corner, click the play icon

:

Run Experiment on Dataset

Run Experiment on Dataset

4

Configure prompts

Configure Prompts

Configure Prompts

For both the prompt to evaluate and the AI Evaluator, you should specify how to resolve template variables with your dataset input, expectedOutput and metadata. The AI Evaluator configuration offers to use the prompt’s output through the output key.

Map Prompt variables

Map Prompt variables

5

Follow progress on your Experiment

Running the experiment will redirect you to the Experiment details page, where you can track progress!

Track Experiment progress

Track Experiment progress

More Evaluators to come soon!

Compare experiments

You can only compare two experiments if they were run on the same dataset.

Comparing two experiments ran on the same dataset.