Introduction

You’re an AI Engineer recently tasked with switching to the latest LLM release?
Follow this tutorial to ship your changes confidently with Literal AI ! 🚀

To illustrate the flow of experiments on Literal AI, we will walk you through swapping gpt-4o with gpt-4o-mini on a RAG application. We will use an already deployed LLM application which answers questions on the Chainlit documentation, using gpt-4o.

In short, it’s a Retrieval Augmented Generation (RAG) application, with access to the Chainlit documentation and cookbooks.

RAG chatbot: left, the chatbot (Chainlit); right, the monitoring (Literal AI)

If you want to learn more about building RAG applications, check out the code here.
Play with it on https://help.chainlit.io or via Discord.

You can use Literal AI to build a validation dataset from production data or hand-labelled items. For our RAG application, we already have the dataset Test dataset to ship RAG.

We can run any kind of experiments on the dataset, with the evaluation metrics of our choice.
Here, we will show the Experiments flow by checking how semantically dissimilar gpt-4o-mini answers are from expected ground truths.

In a real-world scenario, you would have a few metrics you check against before shipping the change: context relevancy, answer similarity, latency, cost, latency, etc.

Setup

Get your API key and connect to Literal AI!

The below cell will prompt you for your LITERAL_API_KEY and create a LiteralClient which we will use to get our dataset and push the result of our experiments 🤗

import os
import getpass

from literalai import LiteralClient

def _set_env(var: str):
    if not os.environ.get(var):
        os.environ[var] = getpass.getpass(f"{var}: ")

_set_env("LITERAL_API_KEY")

literal_client = LiteralClient()

Get dataset

Here’s what our dataset looks like on Literal AI. It contains:

  1. the questions in the Input column
  2. the answers (ground truths) in the Output column
  3. the intermediate steps taken by our RAG agent in the dashed box

We will fetch the whole dataset, but focus on input and output for this tutorial.

Questions and Answers dataset

# Adapt below to your own dataset
dataset = literal_client.api.get_dataset(name="Test dataset to ship RAG")

print(f"Number of samples in dataset = {len(dataset.items)}")

Run experiment against gpt-4o-mini

Load embedding model

We compute Answer Semantic Similarity using gte-base-en-v1.5 hosted on HuggingFace 🤗

Check out the MTEB Leaderboard to pick the right embedding model for your task.

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer('Alibaba-NLP/gte-base-en-v1.5', trust_remote_code=True);

Create experiment

Let’s start with creating a new experiment for our dataset.

It’s good practice to provide a meaningful name summarizing the changes you made.
In the params field, you can pass the exhaustive list of parameters that characterize the experiment you are about to run.

experiment = dataset.create_experiment(
    name="Trial with gpt-4o-mini",
    params={ 
        "model": "gpt-4o-mini",
        "type": "output similarity", 
        "embedding-model": "Alibaba-NLP/gte-base-en-v1.5", 
    }
)

Test each sample

Simply loop on the dataset and for each entry:

  • send the question to the locally modified version of our application (using gpt-4o-mini)
  • compute the cosine similarity between ground truth and the reached answer
  • log the resulting value as a score on our experiment!
import requests
from tqdm import tqdm

for item in tqdm(dataset.items):
    question = item.input["content"]["args"][0]

    # Reached answer - based on locally modified version of the RAG application
    response = requests.get(f"http://localhost/app/{question}")
    answer = response.json()["answer"]
    answer_embedding = model.encode(answer)

    # Ground truth
    ground_truth = item.expected_output["content"]
    ground_truth_embedding = model.encode(ground_truth)

    similarity = float(cos_sim(answer_embedding, ground_truth_embedding))
    
    experiment.log({
        "datasetItemId": item.id,
        "scores": [ {
            "name": "Answer Semantic Similarity",
            "type": "AI",
            "value": similarity
        } ],
        "input": { "question": question },
        "output": { "answer": answer }
    })

Compare experiments on Literal AI 🎉

Here is the comparison between the gpt-4o and gpt-4o-mini experiments on Literal AI!

Comparing gpt-4o vs gpt-4o-mini experiments

We already had the benchmark experiment based on the gpt-4o model.