Why evaluate?

Evaluation is key to enable continuous deployment of LLM-based applications and guarantee that newer versions perform better than previous ones. To best capture the user experience one must understand the multiple steps which make up the application. As AI applications grow in complexity, they tend to chain multiple steps.

Literal AI lets you log & monitor the various steps of your LLM application. By doing so, you can continuously improve the performance of your LLM system, building the most relevant metrics:

LevelMetrics
LLM GenerationHallucination, Toxicity, etc.
Agent RunTask completion, Number of intermediate steps
Conversation ThreadUser satisfaction

An example is the vanilla Retrieval Augmented Generation (RAG), which augments Large Language Models (LLMs) with domain-specific data. Examples of metrics you can score against are: context relevancy, faithfulness, answer relevancy, etc.

How to think about evaluation?

Scores are a crucial part of developing and improving your LLM application or agent.

Who?When?Type of eval metricsExample
End-UserIn ProductionExplicit Feedback (πŸ‘πŸ‘Ž)Thumbs-up or down on a chatbot’s answer
End-UserIn ProductionImplicit Feedback based on product metricUser conversion to paid offering increases by 15%
LLM-as-a-JudgeIn ProductionAI evaluation (without ground truth)Hallucination, context relevancy, etc.
LLM-as-a-JudgeDuring IterationAI evaluation against a Dataset (with ground truth or not)Hallucination, conciseness, helpfulness, context relevancy, answer similarity, etc.
Domain ExpertDuring IterationHuman evaluation against a Dataset (with ground truth or not)Hallucination, conciseness, helpfulness, context relevancy, answer similarity, etc.

Evaluate LLM logs in production automatically

Automating the evaluation of your Run outputs or LLM generations can really help to monitor and improve your LLM app in production, especially with large volumes of data.

Configure an AI Eval

Go to the dashboard and click on Configure AI Evals to create/manage AI Evals.

Configure AI Evals

You will be able to update existing rules or create new ones:

Manage your Rules

A rule is composed of:

  • Name: A name to identify the rule.
  • Type: A type of output: either Tag or Score
  • Sample Rate: The percentage of outputs that will be evaluated by the rule.
  • Filters: Additional conditions to decide if the rule should be triggered.
  • LLM: The model to use for the evaluation.

Tag-based Rules

A rule which outputs a tag requires a list of possible tags and an LLM configuration. The LLM logs are automatically tagged with the most relevant tag based on their semantic meaning.

Under the hood, Literal AI uses an optimized classifier prompt parametrized with the given tags. This classifier prompt is managed by Literal AI and not configurable.

For a custom LLM-as-a-Judge evaluation, consider using Score-based rules.

Score-based Rules

A rule which outputs a Score relies on an Evaluator configuration.

Literal AI lets you customize an Evaluator Prompt directly from the Prompt Playground:

Structured Output

The structured output option lets you enforce a Score schema for the evaluation result. Based on the selected Score schema, which contains the target score categories with their descriptions, the LLM will automatically score your production logs.

Enforce a Score Schema

Once the evaluation prompt saved, you can select it from the Configure Evaluator option, and that will conclude your AI Evals setup:

Configure Evaluator

Evaluate with a custom eval metrics (SDK)

The SDKs provide score creation APIs with all fields exposed. Scores must be tied either to a Step or a Generation object.

Label LLM logs on Literal AI

The Literal AI application offers an easy way to manage scores: Score Schema.

Via Score Schemas, admin users can control and expose to users the various types of evaluations allowed from the application. Score Templates come in two flavors: Categorical and Continuous.

Categorical templates let you create a set of categories, each tied to a numeric value. Continuous templates offer a minimum and a maximum value which users can then select from to score.

From a Step or a Generation, admins can then score by selecting a template and filling in the required form fields.

Label an LLM Generation

For simple adding of Scores, you can do it from the Logs tab. If you plan to annotate a batch of logs, leverage the annotation queue

Automate actions based on evaluation results

Add to Datasets/Annotation Queues

🚧 Work in progress, coming soon 🚧

Add labels

Monitor evaluation results

Once your rules are set up, you can monitor their activity in the dashboard.

Monitor your Rules

You can see the number of invocation per rule as well as the average score of the evaluations.

You can also access the logs of the evaluations to look for potential errors.