Documentation

Data connectivity & integrationPipeline BuilderPipeline managementCreate custom functions

Create custom functions

Custom functions enable you to save a series of transforms as a single transform for reuse across your pipeline. This capability is useful for repeating logic across your pipeline while managing it in a single place.

A custom function consists of a name (required), description (optional), function arguments (optional), and the function definition (required).

Two steps are required to create a custom function:

Define your logic as a series of transform boards.
Convert the transform boards into a new custom function.

Below, we will review an example for each step.

Define a series of transform boards

Suppose you have a table of users, and you want to create a primary key to uniquely identify each user. You know that each user’s first_name, last_name, and first_login_date is a unique combination. You would like to add a column primary_key of type String to the dataset, which is a hash of those three columns, then drop first_name, last_name, and first_login_date. Finally, you want to keep only one row per user if there are duplicates, retaining the row with the lowest age value.

Screenshot of input table with users

First, combine first_name, last_name, and first_login_date into one column. You can use the Concatenate strings transform to add those three columns together.

Screenshot of concatenate strings transform

For the Separator field, enter a -. Then, choose each column in the Expressions dropdown. For the first two columns, choose first_name and last_name. However, first_login_date column is not an option for us to choose from for our third field. This is because it is a Date type, and the Concatenate strings function only accepts String types.

Screenshot of concatenate strings transform without date column

To resolve this, insert a Cast expression from the Expressions tab. The parameters will be first_login_date for the Expression and String for the Type to cast it to. This prevents you from needing to change first_login_date globally, which would affect all downstream transform boards.

Screenshot of concatenate strings transform with cast expression

Once you select Apply, the output table should look like this:

Screenshot of output table after concatenate strings transform

Now, you must de-identify the data within primary_key. One way to do this is to create a hash for each value in primary_key by applying the Hash sha256 transform. You can do this in the same board by selecting the Reuse value option to replace Concatenate strings with Hash sha256.

Screenshot of reuse value with hash sha256 transform

This option keeps the existing values and makes them the first input to the new transform.

Screenshot of hash sha256 transform

After selecting Apply, validate that your output matches what you would expect: a primary_key: String column that contains unique data about each row.

Screenshot of intermediate output table preview

Now, you will notice that the first and last row in primary_key are the same. You want to keep the row with the lower age of 25, and drop first_name, last_name, and first_login_date. To do this, add an Aggregate transform. Input primary_key in the first field Group by columns, and age in the second field Aggregations with a Min expression:

Screenshot of aggregate transform

Finally, the output table should look like the image below. You can verify that there is only one row for primary_key = b3c01... and age is 25.

Screenshot of final output table preview

Convert a series of transform boards into a custom function

Now, suppose that you want to reuse this logic in three different places within your pipeline. We can convert our logic into a custom function by using Shift + Down Arrow to select both boards, then selecting the + button in the top bar.

Screenshot of converting multiple transform boards to custom function

This will take you to the custom function creation page. You will notice that the column inputs first_name, last_name, first_login_date, and age are crossed out; this is because the function you are creating will be generic for any four inputs of the correct types within your pipeline regardless of column name.

Screenshot of custom function creation page

To define these inputs, select Add argument four times. Configure the argument names by clicking into the yellow box.

Screenshot of custom function arguments initial configuration

Then, configure the argument types by clicking into the gear icon to the right of each argument. Here, you want the first two arguments to be type Expression<String> to represent columns of String type, the third argument to be type Expression<Date>, and the last column to be type Expression<Integer>.

Screenshot of custom function arguments complete configuration

Now, you can add your new arguments to the two boards as before. However, they will now be available in the Parameters section instead of the Columns section.

Screenshot of custom function with arguments

Once you have finished your configuration, name the function Primary key generator, give it an optional description, and select Apply all changes.

Screenshot of custom function arguments finalized configuration

The new function is now available for use in any transform path. The next time you need to create a primary key with this pattern, you can search Primary key generator in the transform dropdown.

Screenshot of custom function available in search bar

You can populate your custom function with the original columns: first_name, last_name, first_login_date, and age.

Screenshot of custom function configuration within transform path

By populating the relevant column names into your custom function, you are producing the same result as if you had created the function with the same transforms in the path. However, by creating the custom function, you can save the function logic to reuse across your pipeline rather than recreating logic with separate transform boards.

Screenshot of final output table preview

←

PREVIOUSBuild settings

NEXTShow and hide nodes

→