In this guide you will learn how to build, monitor, evaluate and improve a conversational application.
The chatbot uses Retrieval Augmented Generation (RAG) to retrieve contexts for answer generation.
In combination with Literal AI we will use the following tools :
First, we pick Pinecone as vector database within our RAG. Another vector store could also be used.
Make sure to store your API keys of Literal AI, OpenAI and Pinecone in a .env file as follows.
Make sure you have data documents stored in this vector database.
Beyond the initial imports, here is how the Python code for this chatbot is structured.
First, we create a Pinecone vector index. This vector database is used to store the documents and the
embeddings of the data that you want your chatbot to use. For example, you can embed and store pages
of a technical product documentation.
Check out the Embed the documentation
section to populate your Pinecone index with your own documents.
pinecone_client = Pinecone(api_key=os.environ.get("PINECONE_API_KEY"))pinecone_spec = ServerlessSpec( cloud=os.environ.get("PINECONE_CLOUD"), region=os.environ.get("PINECONE_REGION"),)defcreate_pinecone_index(name, client, spec):""" Create a Pinecone index if it does not already exist."""if name notin client.list_indexes().names(): client.create_index(name, dimension=1536, metric="cosine", spec=spec)return pinecone_client.Index(name)pinecone_index = create_pinecone_index("literal-rag-index", pinecone_client, pinecone_spec)
We then use a prompt defined from a file - or fetched from Literal AI. The prompt definition
includes the name, template_messages, tools and settings. Here is what
the prompt definition looks like:
{"name":"RAG prompt - Tooled","template_messages":[{"role":"system","content":"You are a helpful assistant identifying questions related to Literal AI documentation. Common concepts in the Literal AI doc include: Observability, Prompt, Monitoring, Thread, Step, Run, Generation, Tag, Score, User, Evaluation, Dataset, Experiment, Projects and Roles. Keep it short, and if available prefer responding with code."}],"tools":[{"type":"function","function":{"name":"get_relevant_documentation_chunks","description":"Gets the relevant documentation chunks for a given question or query.","parameters":{"type":"object","properties":{"question":{"type":"string","description":"The question or query to answer on the documentation."},"top_k":{"type":"integer","description":"The number of documentation chunks to retrieve, e.g. 5."}},"required":["question"]}}}],"settings":{"provider":"openai","stop":[],"model":"gpt-4-turbo-preview","top_p":1,"max_tokens":512,"temperature":0,"presence_penalty":0,"frequency_penalty":0}}
We initialize a user session with the prompt. When a user question comes in, we trigger our RAG agent logic,
which we will study in the next step.
prompt_path = os.path.join(os.getcwd(),"app/prompts/rag.json")# We load the RAG prompt in Literal AI to track prompt iteration and# enable LLM replays from Literal AI.withopen(prompt_path,"r")as f: rag_prompt = json.load(f) prompt = client.api.get_or_create_prompt( name=rag_prompt["name"], template_messages=rag_prompt["template_messages"], settings=rag_prompt["settings"], tools=rag_prompt["tools"],)@cl.on_chat_startasyncdefon_chat_start():""" Send a welcome message andset up the initial user session on chat start."""await cl.Message( content="Welcome, please ask me anything about the Literal AI documentation!", disable_feedback=True).send() cl.user_session.set("messages", prompt.format_messages()) cl.user_session.set("settings", prompt.settings) cl.user_session.set("tools", prompt.tools)@cl.on_messageasyncdefmain(message: cl.Message):""" Main message handler for incoming user messages."""await rag_agent(message.content)
Vanilla RAG pipelines usually take in a user’s query and immediately
use it to retrieve the relevant contexts from their vector database.
To highlight the typical flow of a more general LLM-based application (agent),
we chose to instead perform a first LLM call with mention of the retrieval tool.
We thus let the “planner” decide whether to make use of the retrieval: if the
question is a general purpose, we don’t need to query the vector database.
asyncdefllm_tool(question):""" Generate a response from the LLM based on the user's question.""" messages = cl.user_session.get("messages",[])or[] messages.append({"role":"user","content": question}) settings = cl.user_session.get("settings",{})or{} settings["tools"]= cl.user_session.get("tools") settings["tool_choice"]="auto" response =await openai_client.chat.completions.create( messages=messages,**settings,) response_message = response.choices[0].message messages.append(response_message)return response_message
The relevant documents from the vector store are then retrieved and used
as context to answer the user query.
@cl.step(name="Embed",type="embedding")asyncdefembed(question, model="text-embedding-ada-002"):""" Embed a question using the specified model andreturn the embedding.""" embedding =await openai_client.embeddings.create(input=question, model=model)return embedding.data[0].embedding@cl.step(name="Retrieve",type="retrieval")asyncdefretrieve(embedding, top_k):""" Retrieve top_k closest vectors from the Pinecone index using the provided embedding."""if pinecone_index ==None:raise Exception("Pinecone index not initialized") response = pinecone_index.query( vector=embedding, top_k=top_k, include_metadata=True)return response.to_dict()@cl.step(name="Retrieval",type="tool")asyncdefget_relevant_documentation_chunks(question, top_k=5):""" Retrieve relevant documentation chunks based on the question embedding.""" embedding =await embed(question) retrieved_chunks =await retrieve(embedding, top_k)return[match["metadata"]["text"]formatchin retrieved_chunks["matches"]]
Based on the tool results - the relevant contexts retrieved from Pinecone -
we let an LLM generate an answer to the user query.
asyncdefllm_answer(tool_results):""" Generate an answer from the LLM based on the results of tool calls.""" messages = cl.user_session.get("messages",[])or[] messages.extend(tool_results) settings = cl.user_session.get("settings",{})or{} settings["stream"]=True stream =await openai_client.chat.completions.create( messages=messages,**settings,) curr_step = cl.context.current_stepifnot curr_step:return""asyncfor part in stream:if token := part.choices[0].delta.content or"":await curr_step.stream_token(token)await curr_step.update() messages.append({"role":"assistant","content": curr_step.output})return curr_step.output
Note how the different steps of the RAG pipeline are using Literal AI step
decorators: @cl.step(name="", type=""). This is done in order to follow the steps in a thread in Literal AI.
If you are using Chainlit (as in the example above), logging of threads is automatically done.
If you prefer using another method, like FastAPI, Flask or TypeScript, you can use the Literal AI SDK.
For Python, use:
with client.thread()as thread:# do something
The threads or conversations will become visible in the Literal AI UI.
3. Run a few manual sample iterations and feedback
For initial testing purposes, you can manually run a few questions to your chatbot, in order to get familiar with the process and results. The results and steps that the chatbot took in order to come to an answer are visible in the Literal AI UI, under “Threads”. For a real application, you want an extensive test, but it can be useful to get a first glance of the model.
Users can give feedback in the chatbot application to indicate how happy they are with the result. This gives you an indication on how well the model behaves.
As administrator, you can also give feedback on the chatbot’s answers in the threads that already happened.
In order to test the chatbot’s model, you need to create a dataset. First, decide on the evaluation criteria that you want to test. RAGAS defines various metrics that can be tested in isolation, such as faithfulness, answer relevancy, context recall and answer correctness. In this tutorial, we are going to calculate faithfulness ans answer correctness. Faithfullness measures the factual consistency of the answer given the context. Faithfullness is measured by dividing number of truthful claims that can be inferred from the given context by the total number of claims in the chatbot’s generated answer. Answer Correctness measures the accuracy of the generated answer compared to the ground truth.
The dataset for this experiment will consist of examples that we add by hand. The examples have an input and an expected_output. The input is the user query, and the expected_output is the expected answer from the chatbot.
# Create Dataset in Literal AIdataset = literal_client.api.create_dataset( name="TestDataset", description="A dataset to store samples.", metadata={"isDemo":True})# The items with ground truth to add to the datasetitems =[{"input":"What is Literal AI?","expected_output":"Literal AI is a platform for LLM observability, evaluation and prompt collaboration.",},{"input":"What does Literal AI offer?","expected_output":"The platform has a UI and a Python and TypeScript client.",}]# Add items to Dataset in Literal AIfor item in items: dataset.create_item(input={"content": item["input"]}, expected_output={"content": item["expected_output"]})
Note that for a serious evaluation you would want more than two examples to evaluate.
In order to evaluate the chatbot, we need to run the test cases that we saved in the dataset using the LLM model of the chatbot. We will add the chatbot’s answers and contexts to the item list. Run the input questions that you put in the test dataset on the chatbot. In this example, Chainlit is used. You can open the Chainlit chatbot and ask your questions there, in two separate chats. These threads will be visible in the Literal AI UI, under “Threads”.
Test Threads
Test Thread
Next, you want to add the chatbot’s answers and contexts to the list of test items that we created in the previous step. In the next step we can use this list to evaluate using Ragas. Make sure to use the Step ID from the LLM step in the Thread, and Step ID from the Retrieve step in the Thread.
import ast # Replace with your step IDsllm_steps =["b5f69f7e-a9d2-4a6f-8b1a-61e8a2c0a374","b165f8c2-ee08-44ab-9b3a-ae05a669ccaa"]context_steps =["3c848131-bb52-4f38-85c1-afdc1422e748","2d96877d-4beb-4337-978b-f8c19daf3527"]# Add answers to the item listfor idx, llm_step inenumerate(llm_steps): llm_step_data = literal_client.api.get_step(llm_step) items[idx]["answer"]= llm_step_data.output["content"] contexts =[] contexts_data = literal_client.api.get_step(context_steps[idx]) contexts_data_json = ast.literal_eval(contexts_data.output["content"])["matches"]for context in contexts_data_json: contexts.append(context["id"]) items[idx]["context"]= contexts
Next, we can evaluate the two generated answers on faithfulness and answer correctness. For evaluation, we use Ragas, a Python framework to evaluation RAG pipelines. Note that this evaluation is done outside of Literal AI (offline).
from ragas import evaluatefrom ragas.metrics import faithfulness, answer_correctnessfrom datasets import Datasettest_dataset ={'question':[item["input"]for item in items],'answer':[item["answer"]for item in items],'contexts':[item["context"]for item in items],'ground_truth':[item["expected_output"]for item in items]}result = evaluate( Dataset.from_dict(test_dataset), metrics=[ faithfulness, answer_correctness],)df = result.to_pandas()print(df.head())
Finally, you can analyze the results of the evaluation. The results will be shown in the terminal.
Evaluation result using RAGAS
You can see that both answers have the highest score for faithfulness, meaning that the chatbot’s answers are factually consistent given the context. The answer correctness is also high for the first results, but low for the second result. This corresponds to human evaluation if we look at the examples we created. Based on this result, you can decide to improve the chatbot’s model or prompt.
You can also see the experiment results in Literal AI, if you upload the experiment.