Learn how to evaluate your LLM applications and agents.
Level | Metrics |
---|---|
LLM Generation | Hallucination, Toxicity, etc. |
Agent Run | Task completion, Number of intermediate steps |
Conversation Thread | User satisfaction |
Who? | When? | Type of eval metrics | Example |
---|---|---|---|
End-User | In Production | Explicit Feedback (ππ) | Thumbs-up or down on a chatbotβs answer |
End-User | In Production | Implicit Feedback based on product metric | User conversion to paid offering increases by 15% |
LLM-as-a-Judge | In Production | AI evaluation (without ground truth) | Hallucination, context relevancy, etc. |
LLM-as-a-Judge | During Iteration | AI evaluation against a Dataset (with ground truth or not) | Hallucination, conciseness, helpfulness, context relevancy, answer similarity, etc. |
Domain Expert | During Iteration | Human evaluation against a Dataset (with ground truth or not) | Hallucination, conciseness, helpfulness, context relevancy, answer similarity, etc. |