From the Literal AI Platform

Automating the evaluation of your Run outputs or LLM generations can significantly help detect patterns and areas of improvement for your LLM app in production, especially with large volumes of data.

An Online Eval is composed of:

  • Name: A name to identify the rule.
  • Log Type: Either Agent Run or LLM Generation, it’s the target to evaluate.
  • Sample Rate: The percentage of logs to evaluate.
  • Filters: Additional conditions to selectively evaluate certain logs.
  • Scorer: The scorer to use for the evaluation.

To create an Online Eval, go to the Online Evals page and click on the + button in the upper right corner of the table.

Create Online Eval

Once the Online Eval in place, your Runs or LLM Generations get automatically evaluated.

You can check the distribution of scores on an Online Eval’s page:

Online Eval Scores Distribution

If an Online Eval failed on a Run or LLM Generation, the Log column will show the error message.

From the SDKs

The SDKs provide Score creation APIs with all fields exposed.

If your metrics are code-based or combine LLM calls with arithmetic operations, like Ragas, you can directly use the SDKs to create scores from your application code.

Scores must be tied either to a Step or a Generation object.
The concept of Score on a Thread is not well-defined at this stage.

Automation of actions based on evaluation results is coming soon!