Why evaluate?

Evaluation is key to enable continuous deployment of LLM-based applications and guarantee that newer versions perform better than previous ones. To best capture the user experience one must understand the multiple steps which make up the application. As AI applications grow in complexity, they tend to chain multiple steps.

Literal AI lets you log & monitor the various steps of your LLM application. By doing so, you can continuously improve the performance of your LLM system, building the most relevant metrics:

LevelMetrics
LLM GenerationHallucination, Toxicity, etc.
Agent RunTask completion, Number of intermediate steps
Conversation ThreadUser satisfaction

An example is the vanilla Retrieval Augmented Generation (RAG), which augments Large Language Models (LLMs) with domain-specific data. Examples of metrics you can score against are: context relevancy, faithfulness, answer relevancy, etc.

How to think about evaluation?

Scores are a crucial part of developing and improving your LLM application or agent.

Who?When?Type of eval metricsExample
End-UserIn ProductionExplicit Feedback (👍👎)Thumbs-up or down on a chatbot’s answer
End-UserIn ProductionImplicit Feedback based on product metricUser conversion to paid offering increases by 15%
LLM-as-a-JudgeIn ProductionAI evaluation (without ground truth)Hallucination, context relevancy, etc.
LLM-as-a-JudgeDuring IterationAI evaluation against a Dataset (with ground truth or not)Hallucination, conciseness, helpfulness, context relevancy, answer similarity, etc.
Domain ExpertDuring IterationHuman evaluation against a Dataset (with ground truth or not)Hallucination, conciseness, helpfulness, context relevancy, answer similarity, etc.

Leverage Literal AI