Create an evaluation suite

An evaluation suite is a collection of test cases and evaluation functions used to benchmark a given function's performance. Running an evaluation suite will execute the given function for each test case and use the evaluation methods associated with the suite.

To get started with evaluation suites in AIP Logic, refer to evaluation suites for Logic functions. Alternatively, evaluation suites for functions authored in Code Repositories can be created and opened directly from the Published tab by navigating to Code Repositories > Code > Functions > Published.

If you have not already added test cases directly from the Logic Preview run panel, you will need to save your Logic Function before creating an evaluation suite. After saving, you can select Set up tests, which will create the initial evaluation suite.

Evaluations side panel in a AIP Logic function.

Add test cases

You can create test cases for an evaluation suite by manually defining them or by using an object set. To manually define a test case, select Add new test in the bottom-left of the evaluation suite view. Give each test case a name and define the input(s) and their expected values in the appropriate columns. You can select the purple AIP star icon next to the test case name to generate a suggested name.

Generate a suggested test case name.

In this example, the suggested name of Negative Review On Food Quality adds more information than Test case 1:

Suggested names offer a brief description of the test case parameters.

You can edit evaluation suite columns by selecting Edit test case parameters. Here you can add, remove, or reorder test case columns and their respective types.

Configure static test cases.

Alternatively, you can use an object set to define test cases. Each test case will be represented by an object from the selected object set. To choose an object set, select Change to object set backed in the top-right of the test case editor. This will open the object set selection dialog, where you can define the object set and the object properties that you want to use. Note that switching to object set-backed test cases will remove all existing test cases.

Configure object-set-backed test cases.

Evaluators

An evaluator is a method used to evaluate the output of a tested function against expected outputs. An evaluator may return a simple true/false result, but can also produce numeric values such as a semantic distance. An evaluation suite without evaluators is useful for executing functions in multiple scenarios and manually reviewing each output. However, evaluators make it possible to measure and objectify run results at scale, since they produce comparable performance indicators.

AIP Evals provides some built-in evaluators that will be described in the following section. You can also define custom evaluation functions to measure performance based on specific criteria.

Add an evaluator

To add an evaluator, select + Add at the top of the test case table. This will open a selection panel where you can choose from a list of built-in evaluators, marketplace deployed evaluators, or custom evaluators.

Select evaluators for your test cases.

Once you chose an evaluator and select + Add, the evaluator will be added to your test case table. You can then configure the evaluator by mapping the function output to the Actual value column and the expected value to the Expected value column in your test case table.

Configure your evaluators.

Built-in evaluators

Examples of built-in evaluation functions include:

  • Exact boolean match: Checks if the actual boolean is exactly equal to the expected boolean.
  • Exact string match: Checks if the actual string is exactly equal to the expected string.
  • Regex match: Checks if the actual string matches the expected regular expression.
  • Levenshtein distance: A string metric for measuring the difference between two sequences. Calculates the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into the other.
  • String length: Checks if the length of the actual string falls within the expected range.
  • Keyword checker: Checks if specific keywords are present in the actual text.
  • Exact object match: Checks if the actual object is exactly equal to the expected object.
  • Object set contains: Checks if the actual object is exactly equal to one of the objects in the target object set.
  • Object set size range: Checks if the size of the provided object set size lies within the expected range.
  • Integer range: Checks if the actual value lies within the range of expected values. Only integers are supported.
  • Floating-point range: Checks if the actual value lies within the range of expected values. All numeric types are supported as parameters.
  • Temporal range: Checks if the actual value lies within the range of expected values. Only Date and Timestamp values are supported.

Marketplace deployed evaluation functions

Selecting a Marketplace deployed function will open a setup wizard to guide you through the installation process. Below are examples of Marketplace functions, with more to come:

  • Rubric grader: A general purpose LLM-backed evaluator for grading generated text based on a dynamic marking rubric.
  • ROUGE score: The Recall-Oriented Understudy for Gisting Evaluation (ROUGE) scoring is a set of metrics used to evaluate the quality of machine-generated text, particularly in tasks like summarization and translation. Higher ROUGE scores indicate a closer match to the reference text, suggesting better performance of the machine-generated content.

Custom evaluation functions

Custom evaluation functions allow you to select previously published functions. These can be functions on objects written in Code Repositories or other AIP Logic functions. Currently, custom evaluation functions must return either boolean or numeric types.

After creating an evaluation suite, learn more about evaluation suite run configurations.