Examples

High-Level Transforms

It’s common for data transformations in Java to read, process, and write DataFrame objects. Recall that in the Java API, a DataFrame is represented by a Dataset<Row>. If your data transformation depends on DataFrame objects, you can define a high-level Transform. A high-level Transform accepts inputs of type Dataset<Row> and expects the compute function to return a single output of type Dataset<Row>. Alternatively, you can define a more general low-level Transform and explicitly call the asDataFrame() method to access a Dataset<Row> containing your input dataset.

To define a high-level Transform, you define a compute function that takes in any number of inputs of type Dataset<Row> and returns a single output of type Dataset<Row>.

Automatic registration

Here is an example for how to define a Transform by creating a class called HighLevelAutoTransform in the myproject.datasets package:

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 package myproject.datasets; import com.palantir.transforms.lang.java.api.Compute; import com.palantir.transforms.lang.java.api.Input; import com.palantir.transforms.lang.java.api.Output; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; /** * This is an example high-level Transform intended for automatic registration. */ public final class HighLevelAutoTransform { // The class for an automatically registered Transform contains the compute // function and information about the input/output datasets. // Automatic registration requires "@Input" and "@Output" annotations. @Compute @Output("/path/to/output/dataset") public Dataset<Row> myComputeFunction(@Input("/path/to/input/dataset") Dataset<Row> myInput) { // The compute function for a high-level Transform returns an output of type "Dataset<Row>". return myInput.limit(10); } }

High-level Transforms support multiple inputs and a single output. Thus, each input parameter must be annotated with @Input (which contains the full path to your input dataset), and the compute function must be annotated with @Output (which contains the full path to your output dataset).

Now, you can add this Transform to your project’s Pipeline by calling the Pipeline.autoBindFromPackage() method in your PipelineDefiner implementation:

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 package myproject; import com.palantir.transforms.lang.java.api.Pipeline; import com.palantir.transforms.lang.java.api.PipelineDefiner; public final class MyPipelineDefiner implements PipelineDefiner { @Override public void define(Pipeline pipeline) { // Provide the Java package containing any Transforms you want to // automatically register. pipeline.autoBindFromPackage("myproject.datasets"); } }

Manual registration

Here is an example for how to define a Transform by creating a class called HighLevelManualFunction in the myproject.datasets package:

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 package myproject.datasets; import com.palantir.transforms.lang.java.api.Compute; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; /** * This is an example compute function for a high-level Transform intended for manual registration. */ public final class HighLevelManualFunction { // The class for a manually registered Transform contains just the compute function. @Compute public Dataset<Row> myComputeFunction(Dataset<Row> myInput) { // The compute function for a high-level Transform returns an output of type "Dataset<Row>". return myInput.limit(10); } }

Now, in your PipelineDefiner implementation, you finish defining your Transform using HighLevelTransform.builder(), and you add this Transform to your project’s Pipeline by calling Pipeline.register():

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 package myproject; import com.palantir.transforms.lang.java.api.Pipeline; import com.palantir.transforms.lang.java.api.PipelineDefiner; public final class MyPipelineDefiner implements PipelineDefiner { @Override public void define(Pipeline pipeline) { // This is a sample manual registration for a high-level Transform. HighLevelTransform highLevelManualTransform = HighLevelTransform.builder() // Pass in the compute function to use. Here, "HighLevelManualFunction" corresponds // to the class name for a compute function for a high-level Transform. .computeFunctionInstance(new HighLevelManualFunction()) // Pass in the input dataset(s) to use. // "myInput" corresponds to an input parameter for your compute function. .putParameterToInputAlias("myInput", "/path/to/input/dataset") // Pass in the output dataset to use. .returnedAlias("/path/to/output/dataset") .build(); pipeline.register(highLevelManualTransform); } }

High-level Transforms support multiple inputs and a single output. Each input dataset for your compute function should be provided using putParameterToInputAlias()—this method requires an input name corresponding to a parameter for your compute function followed by the full path to your input dataset. For instance, in the example above, “myInput” is an input parameter name in my_compute_function(). Provide the full path to your output dataset using returnedAlias().

Low-level transforms

A low-level Transform can be used if you’re writing data transformations that depend on DataFrame objects or files.

Automatic registration

Let’s say you’re using automatic registration. Here is an example for how to create a Transform object by defining a class called LowLevelAutoTransform in the myproject.datasets package:

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 package myproject.datasets; import com.palantir.transforms.lang.java.api.Compute; import com.palantir.transforms.lang.java.api.FoundryInput; import com.palantir.transforms.lang.java.api.FoundryOutput; import com.palantir.transforms.lang.java.api.Input; import com.palantir.transforms.lang.java.api.Output; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; /** * This is an example low-level Transform intended for automatic registration. */ public final class LowLevelAutoTransform { // The class for an automatically registered Transform contains the compute // function and information about the input/output datasets. // Automatic registration requires "@Input" and "@Output" annotations. @Compute public void myComputeFunction( @Input("/path/to/input/dataset") FoundryInput myInput, @Output("/path/to/output/dataset") FoundryOutput myOutput) { Dataset<Row> limited = myInput.asDataFrame().read().limit(10); // The compute function for a low-level Transform writes to the output dataset(s), // instead of returning the output(s). myOutput.getDataFrameWriter(limited).write(); } }

Low-level Transforms support a multiple input and output datasets. Thus, each input parameter must be annotated with @Input (which contains the full path to your input dataset), and each output parameter must be annotated with @Output (which contains the full path to your output dataset).

Now, you can add this Transform to your project’s Pipeline by calling the Pipeline.autoBindFromPackage() method in your PipelineDefiner implementation:

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 package myproject; import com.palantir.transforms.lang.java.api.Pipeline; import com.palantir.transforms.lang.java.api.PipelineDefiner; public final class MyPipelineDefiner implements PipelineDefiner { @Override public void define(Pipeline pipeline) { // Provide the Java package containing any Transforms you want to // automatically register. pipeline.autoBindFromPackage("myproject.datasets"); } }

Manual registration

Now, let’s say you’re using manual registration. In this case, you would define a class that contains just your compute function. Here is an example for how to define a class called LowLevelManualFunction in the myproject.datasets package:

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 package myproject.datasets; import com.palantir.transforms.lang.java.api.Compute; import com.palantir.transforms.lang.java.api.FoundryInput; import com.palantir.transforms.lang.java.api.FoundryOutput; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; /** * This is an example compute function for a low-level Transform intended for manual registration. */ public final class LowLevelManualFunction { // The class for a manually registered Transform contains just the compute function. @Compute public void myComputeFunction(FoundryInput myInput, FoundryOutput myOutput) { Dataset<Row> limited = myInput.asDataFrame().read().limit(10); // The compute function for a low-level Transform writes to the output dataset(s), // instead of returning the output(s). myOutput.getDataFrameWriter(limited).write(); } }

Now, in your PipelineDefiner implementation, you create your actual Transform object using LowLevelTransform.builder(), and you add this Transform to your project’s Pipeline by calling Pipeline.register():

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 package myproject; import com.palantir.transforms.lang.java.api.Pipeline; import com.palantir.transforms.lang.java.api.PipelineDefiner; public final class MyPipelineDefiner implements PipelineDefiner { @Override public void define(Pipeline pipeline) { // This is a sample manual registration for a low-level Transform. LowLevelTransform lowLevelManualTransform = LowLevelTransform.builder() // Pass in the compute function to use. Here, "LowLevelManualFunction" corresponds // to the class name for a compute function for a low-level Transform. .computeFunctionInstance(new LowLevelManualFunction()) // Pass in the input dataset(s) to use. // "myInput" corresponds to an input parameter for your compute function. .putParameterToInputAlias("myInput", "/path/to/input/dataset") // Pass in the output dataset(s) to use. // "myOutput" corresponds to an input parameter for your compute function. .putParameterToOutputAlias("myOutput", "/path/to/output/dataset") .build(); pipeline.register(lowLevelManualTransform); } }

Low-level Transforms support multiple input and output datasets. Each input dataset for your compute function should be provided using putParameterToInputAlias(), and each output dataset should be provided using putParameterToOutputAlias(). These methods require an input/output name corresponding to a parameter for your compute function as well as the full path to your input/output dataset. For instance, in the example above, “myInput” and “myOutput” are input parameter names in my_compute_function(). Recall that the compute function for a low-level Transform writes to output datasets and does not return a value. This is why your input/output datasets are passed in as parameters to the compute function.