Create an evaluation suite

An evaluation suite is the collection of evaluation functions and test cases used to build performance benchmarks for a given AIP Logic function. To create an evaluation suite, you must configure the evaluation functions and define the test cases that will be passed to evaluation functions during evaluation suite runs.

Note that some Evaluations features, such as creating test cases from object sets, are not available from the Logic run panel, and can only be accessed from the Evaluations application. For Evaluations functionality within Logic, refer to the Logic Evaluations getting started page. This page details functionality available in the Evaluations application.

If you have not already added test cases directly from the Logic run panel, you will need to save your logic function before creating an evaluation suite. After saving, you can select Set up tests, which will take you to the Evaluations application.

Evaluations side panel in a AIP Logic function.

Add test cases

In Evaluations, you can create test cases by using an object set or by manually defining them. To manually define a test case, select Add test case in the upper right. Give each test case a name and select the input(s) and their respective expected values. The actual output value is automatically included as part of the test case and does not need to be configured.

The test Case configuration screen.

Evaluation functions

An evaluation function is the method used when comparing or evaluating the actual output of a Logic function against the expected output(s). You can configure an evaluation function by selecting parameters for the actual Logic function output value and the expected output value. Depending on the evaluation function, you may need to configure other parameters. Evaluation suites can include built-in functions, marketplace deployed functions, or custom evaluation functions.

Built-in evaluation functions

Examples of built-in evaluation functions include:

  • Exact string match: Checks if the actual string is exactly equal to the expected string.
  • Integer range: Checks if the actual value lies within the range of expected values. Only integers are supported.
  • Exact boolean match: Checks if the actual boolean is exactly equal to the expected boolean.
  • Exact object match: Checks if the actual object is exactly equal to the expected object.
  • Floating-point range: Checks if the actual value lies within the range of expected values. All numeric types are supported as parameters.
  • Temporal range: Checks if the actual value lies within the range of expected values. Only Date and Timestamp values are supported.

Marketplace deployed functions

Selecting a Marketplace deployed function will open a setup wizard to guide you through the installation process. Below are examples of Marketplace functions, with more to come:

  • Rubric grader: A general purpose LLM-backed evaluator for grading generated text based on a dynamic marking rubric.
  • ROUGE score: The Recall-Oriented Understudy for Gisting Evaluation (ROUGE) scoring is a set of metrics used to evaluate the quality of machine-generated text, particularly in tasks like summarization and translation. Higher ROUGE scores indicate a closer match to the reference text, suggesting better performance of the machine-generated content.

Custom evaluation functions

Custom evaluation functions allow you to select previously published functions. These can be functions on objects written in Code Repositories or other AIP Logic functions. Currently, custom evaluation functions must return either boolean or numeric types.

Configure evaluation functions

To configure an evaluation function, select Add evaluation function from the configuration panel on the right side of the evaluation suite.

A new evaluation suite.

You can choose from a series of built-in or Marketplace deployed functions. You also have the option of selecting a custom evaluation function.

Evaluation function selection window.

The Produced metrics field allows you to name the metric displayed in the evaluations metrics dashboard. For example, instead of the default "isExactMatch", you may choose to rename the metric to something more semantically meaningful to your use case, like "classificationIsCorrect."

Evaluation function configuration panel with function parameters.

After configuring your function as described above, it will be available in its respective evaluation suite along with added test cases.