Swap LLMs and Validate App Performance
Introduction
You’re an AI Engineer recently tasked with switching to the latest LLM release?
Follow this tutorial to ship your changes confidently with Literal AI ! 🚀
To illustrate the flow of experiments on Literal AI, we will walk you through swapping gpt-4o
with gpt-4o-mini
on a RAG application. We will
use an already deployed LLM application which answers questions on the Chainlit documentation, using gpt-4o
.
In short, it’s a Retrieval Augmented Generation (RAG) application, with access to the Chainlit documentation and cookbooks.
RAG chatbot: left, the chatbot (Chainlit); right, the monitoring (Literal AI)
If you want to learn more about building RAG applications, check out the code here.
Play with it on https://help.chainlit.io or via Discord.
You can use Literal AI to build a validation dataset from production data or hand-labelled items.
For our RAG application, we already have the dataset Test dataset to ship RAG
.
We can run any kind of experiments on the dataset, with the evaluation metrics of our choice.
Here, we will show the Experiments flow by checking how semantically dissimilar gpt-4o-mini
answers are from expected ground truths.
In a real-world scenario, you would have a few metrics you check against before shipping the change: context relevancy, answer similarity, latency, cost, latency, etc.
Setup
Get your API key and connect to Literal AI!
The below cell will prompt you for your LITERAL_API_KEY
and create a LiteralClient
which we will use to get our dataset and push the result of our experiments 🤗
Get dataset
Here’s what our dataset looks like on Literal AI. It contains:
- the questions in the Input column
- the answers (ground truths) in the Output column
- the intermediate steps taken by our RAG agent in the dashed box
We will fetch the whole dataset, but focus on input
and output
for this tutorial.
Questions and Answers dataset
Run experiment against gpt-4o-mini
Load embedding model
We compute Answer Semantic Similarity using gte-base-en-v1.5 hosted on HuggingFace 🤗
Check out the MTEB Leaderboard to pick the right embedding model for your task.
Create experiment
Let’s start with creating a new experiment for our dataset.
It’s good practice to provide a meaningful name summarizing the changes you made.
In the params
field, you can pass the exhaustive list of parameters that characterize the experiment you are about to run.
Test each sample
Simply loop on the dataset and for each entry:
- send the
question
to the locally modified version of our application (usinggpt-4o-mini
) - compute the cosine similarity between ground truth and the reached answer
- log the resulting value as a score on our experiment!
Compare experiments on Literal AI 🎉
Here is the comparison between the gpt-4o
and gpt-4o-mini
experiments on Literal AI!
Comparing gpt-4o vs gpt-4o-mini experiments
We already had the benchmark experiment based on the gpt-4o
model.