The source code of this notebook can be found in the Literal AI Github Cookbooks.
This notebook shows you how to validate changes on your RAG application against context relevancy.
We rely on Ragas to evaluate that metric, to then visualize our iterative experiments in Literal AI.
First, we create a dataset from an example RAG application. Second, we evaluate the impact of a retrieval parameter change (# of contexts) on context relevancy:
import chromadbchroma_client = chromadb.Client()collection = chroma_client.get_or_create_collection("Biography")collection.add( documents=["My name is John.","My job is coding.","My dog's name is Fido. Fido is an expert fetcher."], ids=["id1","id2","id3"])
PROMPT_NAME ="RAG prompt"template_messages =[{"role":"system","content":"You are a helpful assistant that always answers questions. Keep it short, and if available prefer responding with code."},{"role":"user","content":"Answer the question based on the context below.\nContext:\n{{#context}}\n{{.}}\n{{/context}}\nQuestion:\n{{question}}\nAnswer:"}]prompt = literal_client.api.get_or_create_prompt(name=PROMPT_NAME, template_messages=template_messages)
@literal_client.step(type="run", name="RAG")defrag(user_query:str):with literal_client.step(type="retrieval", name="Retrieve")as step: step.input={"question": user_query } results = collection.query(query_texts=[user_query], n_results=2) step.output = results messages = prompt.format_messages(context=results["documents"][0], question=user_query) completion = openai_client.chat.completions.create( model="gpt-3.5-turbo", messages=messages,)return completion.choices[0].message.contentdefmain(): questions =["What's my name?","What's my job?"]for idx, question inenumerate(questions):with literal_client.thread(name=f"Question {idx+1}")as thread: literal_client.message(content=question,type="user_message", name="User") answer = rag(question) literal_client.message(content=answer,type="assistant_message", name="My Assistant")main()# Network requests by the SDK are performed asynchronously.# Invoke flush() to guarantee the completion of all requests prior to the process termination.# WARNING: If you run a continuous server, you should not use this method.literal_client.flush()
import jsonfrom literalai import DatasetItemfrom typing import Listitems = dataset.items# Get the retrieved contexts for each question.contexts =[]for item in items: retrieve_step =next(step for step in item.intermediary_steps if step["name"]=="Retrieve") contexts.append(retrieve_step["output"]["documents"][0])# Data samples, in the format expected by Ragas. No ground truth needed since we will evaluate context relevancy.data_samples ={'question':[item.input["args"][0]for item in items],'answer':[item.expected_output["content"]for item in items],'contexts': contexts,'ground_truth':[""]*len(items)}
Comparing both experiments from Literal AI, one can visualize the diff in retrieved contexts, two for experiment A versus one for experiment B.
Context relevancy captures the ratio of question-relevant facts in retrieved contexts.
When we retrieve irrelevant contexts (the two facts about the dog do not help towards answering the question), context relevancy is 1/3.
Once we limit ourselves to a single context, we retrieve exactly the one useful fact, which yields a maximum context relevancy of 1.