Automatically evaluate your LLM logs in production, monitor performance and detect issues.
Automating the evaluation of your Run outputs or LLM generations can significantly help detect patterns and areas of improvement for your LLM app in production, especially with large volumes of data.
An Online Eval is composed of:
Agent Run
or LLM Generation
, it’s the target to evaluate.To create an Online Eval, go to the Online Evals
page and click on the +
button in the upper right corner of the table.
Create Online Eval
Once the Online Eval in place, your Runs or LLM Generations get automatically evaluated.
You can check the distribution of scores on an Online Eval’s page:
Online Eval Scores Distribution
If an Online Eval failed on a Run or LLM Generation, the Log
column will show the error message.
The SDKs provide Score
creation APIs with all fields exposed.
If your metrics are code-based or combine LLM calls with arithmetic operations, like Ragas, you can directly use the SDKs to create scores from your application code.
Scores must be tied either to a Step
or a Generation
object.
The concept of Score
on a Thread
is not well-defined at this stage.
Automation of actions based on evaluation results is coming soon!
Automatically evaluate your LLM logs in production, monitor performance and detect issues.
Automating the evaluation of your Run outputs or LLM generations can significantly help detect patterns and areas of improvement for your LLM app in production, especially with large volumes of data.
An Online Eval is composed of:
Agent Run
or LLM Generation
, it’s the target to evaluate.To create an Online Eval, go to the Online Evals
page and click on the +
button in the upper right corner of the table.
Create Online Eval
Once the Online Eval in place, your Runs or LLM Generations get automatically evaluated.
You can check the distribution of scores on an Online Eval’s page:
Online Eval Scores Distribution
If an Online Eval failed on a Run or LLM Generation, the Log
column will show the error message.
The SDKs provide Score
creation APIs with all fields exposed.
If your metrics are code-based or combine LLM calls with arithmetic operations, like Ragas, you can directly use the SDKs to create scores from your application code.
Scores must be tied either to a Step
or a Generation
object.
The concept of Score
on a Thread
is not well-defined at this stage.
Automation of actions based on evaluation results is coming soon!