Experiments
Experiments enable continous improvement of your Prompt/Agent — i.e. guarantee net improvements.
An experiment evaluates the performance of your LLM system (simple prompt or multi-step LLM chain/agent) against a Dataset and a set of Evaluation Metrics.
Experiment distribution chart
Run Experiments from Literal AI
Run an Experiment
on a Prompt
against a Dataset
and a set of Scorers
from Literal AI.
Experiments can be run directly from the Prompt Playground. This allows you to run experiments without having to manage an infrastructure.
Prompt to iterate on
Go to the Prompt Playground, make modifications to your prompt and vibe-check it.
Prompt to iterate on
If you struggle to start, select one of our examples from the top right corner.
Pick a Dataset and select Scorers
In the upper right corner, click “Experiment on Dataset”.
Experiment on Dataset
You should specify how to resolve prompt variables with your dataset input
, expectedOutput
and metadata
columns. The Scorer
configuration offers to
use the prompt’s completion through the output
key.
Running the experiment will redirect you to the Experiment details page, where you can track progress!
More Evaluators to come soon!
Compare experiments
Comparing two experiments ran on the same dataset.
Run an experiment from your code
Complex multi-step LLM systems are heavily dependent on your code and instrastructure. Literal AI enables you to evaluate your LLM systems from your own code and then log the results on Literal AI.
Here is a naive example of how you can run an experiment with Literal AI:
Link to a Dataset
The best way to run an experiment is to use a Dataset to store your inputs and expected outputs. This way you can track on which data your experiment was run and compare the results of different experiments.
Using a dataset to run an experiment is very similar to the previous example, except that you are iterating over the items of the dataset:
Link to a Prompt
If you are evaluating a prompt living on Literal AI, you can bind it to the experiment to track the performance of the prompt.
Was this page helpful?