An experiment evaluates the performance of your LLM system (simple prompt or multi-step LLM chain/agent) against a Dataset and a set of Evaluation Metrics.

Experiment distribution chart

Run Experiments from Literal AI

Run an Experiment on a Prompt against a Dataset and a set of Scorers from Literal AI.

Experiments can be run directly from the Prompt Playground. This allows you to run experiments without having to manage an infrastructure.

1

Prompt to iterate on

Go to the Prompt Playground, make modifications to your prompt and vibe-check it.

Prompt to iterate on

If you struggle to start, select one of our examples from the top right corner.

2

Pick a Dataset and select Scorers

In the upper right corner, click “Experiment on Dataset”.

Experiment on Dataset

You should specify how to resolve prompt variables with your dataset input, expectedOutput and metadata columns. The Scorer configuration offers to use the prompt’s completion through the output key.

Running the experiment will redirect you to the Experiment details page, where you can track progress!

More Evaluators to come soon!

Compare experiments

You can only compare two experiments if they were run on the same dataset.

Comparing two experiments ran on the same dataset.

Run an experiment from your code

Complex multi-step LLM systems are heavily dependent on your code and instrastructure. Literal AI enables you to evaluate your LLM systems from your own code and then log the results on Literal AI.

Here is a naive example of how you can run an experiment with Literal AI:

See installation to get your API key and instantiate the SDK

The best way to run an experiment is to use a Dataset to store your inputs and expected outputs. This way you can track on which data your experiment was run and compare the results of different experiments.

Using a dataset to run an experiment is very similar to the previous example, except that you are iterating over the items of the dataset:

If you are evaluating a prompt living on Literal AI, you can bind it to the experiment to track the performance of the prompt.