AIP Evals is a testing environment to evaluate the performance of your AIP Logic functions, AIP Agent functions, or code-authored functions. It is specifically designed to help you deal with the non-deterministic nature of LLMs. AIP Evals allows you to create test cases, define evaluation functions to measure performance, and compare the results against previous versions of your function. It enables you to build the necessary confidence to put LLM-backed functions into production or make changes to an existing implementation.
You can use AIP Evals to:
Create test cases and define evaluation criteria.
Debug, iterate, and improve functions and prompts.
Compare the performance of different models on your functions.
Examine variance across multiple runs.
Core concepts
Evaluation suite: The collection of test cases and evaluation functions used to benchmark function performance.
Evaluation function: The method used when comparing or evaluating the actual output of a function against the expected output(s).
Test cases: Defined sets of inputs and expected outputs that are passed into evaluation functions during evaluation suite runs.
Metrics: The results of evaluation functions. Metrics are produced per test case and can be compared in aggregate or individually between runs.