Evaluation
Learn how to evaluate your LLM applications and agents.
Why evaluate?
Evaluation is key to enable continuous deployment of LLM-based applications and guarantee that newer versions perform better than previous ones. To best capture the user experience one must understand the multiple steps which make up the application. As AI applications grow in complexity, they tend to chain multiple steps.
Literal AI lets you log & monitor the various steps of your LLM application. By doing so, you can continuously improve the performance of your LLM system, building the most relevant metrics:
Level | Metrics |
---|---|
LLM Generation | Hallucination, Toxicity, etc. |
Agent Run | Task completion, Number of intermediate steps |
Conversation Thread | User satisfaction |
An example is the vanilla Retrieval Augmented Generation (RAG), which augments Large Language Models (LLMs) with domain-specific data. Examples of metrics you can score against are: context relevancy, faithfulness, answer relevancy, etc.
How to think about evaluation?
Scores are a crucial part of developing and improving your LLM application or agent.
Who? | When? | Type of eval metrics | Example |
---|---|---|---|
End-User | In Production | Explicit Feedback (👍👎) | Thumbs-up or down on a chatbot’s answer |
End-User | In Production | Implicit Feedback based on product metric | User conversion to paid offering increases by 15% |
LLM-as-a-Judge | In Production | AI evaluation (without ground truth) | Hallucination, context relevancy, etc. |
LLM-as-a-Judge | During Iteration | AI evaluation against a Dataset (with ground truth or not) | Hallucination, conciseness, helpfulness, context relevancy, answer similarity, etc. |
Domain Expert | During Iteration | Human evaluation against a Dataset (with ground truth or not) | Hallucination, conciseness, helpfulness, context relevancy, answer similarity, etc. |
Leverage Literal AI
Was this page helpful?