LLMs are powerful, but inherently unpredictable due to their non-deterministic nature. To put LLM-backed workflows into production or make changes to an existing implementation, confidence is essential.
AIP Evals helps you build that confidence by providing the means to evaluate your LLM-based functions and prompts. You can use AIP Evals to:
Create test cases and define evaluation criteria.
Debug, iterate, and improve functions and prompts.
Compare different models, like GPT-4o and GPT-4o mini on your functions.
Examine variance across multiple runs.
Core concepts
Evaluation suite: The collection of test cases and evaluation functions used to benchmark function performance.
Evaluation function: The method used when comparing or evaluating the actual output of a function against the expected output(s).
Test cases: Defined sets of inputs and expected outputs that are passed into evaluation functions during evaluation suite runs.
Metrics: The results of evaluation functions. Metrics are produced per test case and can be compared in aggregate or individually between runs.