Why evaluate?

Evaluation is key to enable continuous deployment of LLM-based applications and guarantee that newer versions perform better than previous ones. To best capture the user experience one must understand the multiple steps which make up the application. As AI applications grow in complexity, they tend to chain multiple steps. Literal AI lets you log & monitor the various steps of your LLM application. By doing so, you can continuously improve the performance of your LLM system, building the most relevant metrics:

Level	Metrics
LLM Generation	Hallucination, Toxicity, etc.
Agent Run	Task completion, Number of intermediate steps
Conversation Thread	User satisfaction

An example is the vanilla Retrieval Augmented Generation (RAG), which augments Large Language Models (LLMs) with domain-specific data. Examples of metrics you can score against are: context relevancy, faithfulness, answer relevancy, etc.

How to think about evaluation?

Scores are a crucial part of developing and improving your LLM application or agent.

Who?	When?	Type of eval metrics	Example
End-User	In Production	Explicit Feedback (👍👎)	Thumbs-up or down on a chatbot’s answer
End-User	In Production	Implicit Feedback based on product metric	User conversion to paid offering increases by 15%
LLM-as-a-Judge	In Production	AI evaluation (without ground truth)	Hallucination, context relevancy, etc.
LLM-as-a-Judge	During Iteration	AI evaluation against a Dataset (with ground truth or not)	Hallucination, conciseness, helpfulness, context relevancy, answer similarity, etc.
Domain Expert	During Iteration	Human evaluation against a Dataset (with ground truth or not)	Hallucination, conciseness, helpfulness, context relevancy, answer similarity, etc.

Leverage Literal AI

Setup LLM-as-a-Judge Scorers

Automatically evaluate your LLM logs in production, monitor performance and detect issues.

Add a score using the SDKs

Add scores to your LLM logs using the SDKs.

Important Notice

Get Started

Application

Evaluation

Settings

Guides

Integrations

Self Hosting

More

Evaluation

Why evaluate?

How to think about evaluation?

Leverage Literal AI

Setup LLM-as-a-Judge Scorers

Add a score using the SDKs

Important Notice

Get Started

Application

Evaluation

Settings

Guides

Integrations

Self Hosting

More

​Why evaluate?

​How to think about evaluation?

​Leverage Literal AI

Setup LLM-as-a-Judge Scorers

Add a score using the SDKs

Why evaluate?

How to think about evaluation?

Leverage Literal AI