AIP Evals

LLMs are powerful, but inherently unpredictable due to their non-deterministic nature. To put LLM-backed workflows into production or make changes to an existing implementation, confidence is essential.

AIP Evals helps you build that confidence by providing the means to evaluate your LLM-based functions and prompts. You can use AIP Evals to:

  • Create test cases and define evaluation criteria.
  • Debug, iterate, and improve functions and prompts.
  • Compare different models, like GPT-4o and GPT-4o mini on your functions.
  • Examine variance across multiple runs.

Evals overview

Core concepts

Evaluation suite: The collection of test cases and evaluation functions used to benchmark function performance.

Evaluation function: The method used when comparing or evaluating the actual output of a function against the expected output(s).

Test cases: Defined sets of inputs and expected outputs that are passed into evaluation functions during evaluation suite runs.

Metrics: The results of evaluation functions. Metrics are produced per test case and can be compared in aggregate or individually between runs.

To get started, create an evaluation suite for logic functions, or create an evaluation suite for general functions, and learn more about evaluation run configurations.