Experiments

An experiment evaluates the performance of your LLM system (simple prompt or multi-step LLM chain/agent) against a Dataset and a set of Evaluation Metrics.

Experiment distribution chart

Run Experiments from Literal AI

Run an Experiment on a Prompt against a Dataset and a set of Scorers from Literal AI. Experiments can be run directly from the Prompt Playground. This allows you to run experiments without having to manage an infrastructure.

Prompt to iterate on

Go to the Prompt Playground, make modifications to your prompt and vibe-check it.

Prompt to iterate on

If you struggle to start, select one of our examples from the top right corner.

Pick a Dataset and select Scorers

In the upper right corner, click “Experiment on Dataset”.

Experiment on Dataset

You should specify how to resolve prompt variables with your dataset input, expectedOutput and metadata columns. The Scorer configuration offers to use the prompt’s completion through the output key.

Running the experiment will redirect you to the Experiment details page, where you can track progress!

More Evaluators to come soon!

Compare experiments

You can only compare two experiments if they were run on the same dataset.

Comparing two experiments ran on the same dataset.

Run an experiment from your code

Complex multi-step LLM systems are heavily dependent on your code and instrastructure. Literal AI enables you to evaluate your LLM systems from your own code and then log the results on Literal AI. Here is a naive example of how you can run an experiment with Literal AI:

See installation to get your API key and instantiate the SDK

inputs = [{"question": "question"}]

experiment = literalai_client.api.create_experiment(
    name="Foo", params=[{"foo": "bar"}]  # optional
)

@literalai_client.run
def my_agent(input):
    # Faking the agent response
    return {"content": "answer"}

def score_output(output):
    # Faking the scoring
    return [{"name": "context_relevancy", "type": "AI", "value": 0.6}]

@literalai_client.experiment_item_run
def run_and_eval(input):
    output = my_agent(input)
    experiment_item = {
      "scores": score_output(output),
      "input": input,
      "output": output
    }
    experiment.log(experiment_item)

def run_experiment(inputs):
    for input in inputs:
        run_and_eval(input)

run_experiment(inputs)

Link to a Dataset

The best way to run an experiment is to use a Dataset to store your inputs and expected outputs. This way you can track on which data your experiment was run and compare the results of different experiments. Using a dataset to run an experiment is very similar to the previous example, except that you are iterating over the items of the dataset:

dataset_id = "MY_DATASET_ID"

dataset = literalai_client.api.get_dataset(dataset_id)

experiment = literalai_client.api.create_experiment(
    dataset_id=dataset_id,
    name="Foo",
    params=[{"foo": "bar"}]  # optional
)

@literalai_client.experiment_item_run
def run_and_eval(item):
    output = my_agent(item.input)
    experiment_item = {
      # Notice that the experiment item is now linked to the dataset item
      "datasetItemId": item.id,
      "scores": score_output(output),
      "input": item.input,
      "output": output
    }
    experiment.log(experiment_item)

def run_experiment():
    for item in dataset.items:
        run_and_eval(item)

Link to a Prompt

If you are evaluating a prompt living on Literal AI, you can bind it to the experiment to track the performance of the prompt.

prompt = literalai_client.api.get_prompt(name="MY_PROMPT", version=0)

experiment = literalai_client.api.create_experiment(
    prompt_id=prompt.id,
    name="Foo",
    params=[{"foo": "bar"}]  # optional
)

# Run the experiment the same way as before

Important Notice

Get Started

Application

Evaluation

Settings

Guides

Integrations

Self Hosting

More

Run Experiments from Literal AI

Compare experiments

Run an experiment from your code

Link to a Dataset

Link to a Prompt

Important Notice

Get Started

Application

Evaluation

Settings

Guides

Integrations

Self Hosting

More

​Run Experiments from Literal AI

​Compare experiments

​Run an experiment from your code

​Link to a Dataset

​Link to a Prompt

Run Experiments from Literal AI

Compare experiments

Run an experiment from your code

Link to a Dataset

Link to a Prompt