Incremental computation is an efficient method of performing transforms to generate an output dataset. By leveraging the build history of a transform, incremental computation avoids the need to recompute the entire output dataset every time a transform is run.
For end-to-end guidance on how to create and manage incremental pipelines, see the building pipelines section.
In this section, we examine the benefits of incremental transforms by first considering a code example that does not use incremental transforms:
Copied!1 2 3 4 5 6 7 8 9 10 11 12
from transforms.api import transform, Input, Output @transform( students=Input('/examples/students_hair_eye_color'), processed=Output('/examples/hair_eye_color_processed') ) def filter_hair_color(students, processed): # type: (TransformInput, TransformOutput) -> None students_df = students.dataframe() # this is inefficient, as the filter function is performed over the entire input rather than on new data only processed.write_dataframe(students_df.filter(students_df.hair == 'Brown'))
If any new data is added to the /examples/students_hair_eye_color
input dataset, the filter()
is performed over the entire input, rather than just the new data added to the input. This is wasteful of both compute resources and time.
If a transform can become aware of its build history, it can be smarter about how to compute its output. More specifically, it can use the changes made to the inputs to modify the output dataset. This process of using already materialized data when re-materializing tables is called incremental computation. Without incremental computation, the output dataset is always replaced by the latest output of the transform.
Let’s go back to the example transform shown above. The transform performs a filter()
over the students
dataset to write out students with brown hair. With incremental computation, if data about two new students is appended to students
, the transform can use information about its build history to append only the new brown-haired students to the output:
+---+-----+-----+------+ +---+-----+-----+------+
| id| hair| eye| sex| | id| hair| eye| sex|
+---+-----+-----+------+ Build 1 +---+-----+-----+------+
| 17|Black|Green|Female| ---------> | 18|Brown|Green|Female|
| 18|Brown|Green|Female| +---+-----+-----+------+
| 19| Red|Black|Female|
+---+-----+-----+------+
... ...
+---+-----+-----+------+ Build 2 +---+-----+-----+------+
| 20|Brown|Amber|Female| ---------> | 20|Brown|Amber|Female|
| 21|Black|Blue |Male | +---+-----+-----+------+
+---+-----+-----+------+
The example transform above can therefore be rewritten using incremental logic with the following syntax:
Copied!1 2 3 4 5 6 7 8 9 10 11
from transforms.api import transform, incremental, Input, Output @incremental() @transform( students=Input('/examples/students_hair_eye_color'), processed=Output('/examples/hair_eye_color_processed') ) def filter_hair_color(students, processed): # type: (IncrementalTransformInput, IncrementalTransformOutput) -> None students_df = students.dataframe('added') processed.write_dataframe(students_df.filter(students_df.hair == 'Brown'))
For more information on incremental transforms and the @incremental
decorator, proceed to the Incremental Transforms Reference.
It is safe to compute joins when one of the datasets is a reference table that gets read fully and the other one is an incremental dataset that is read incrementally. However, reading both datasets that take part in a join incrementally requires special handling, see example leveraging incremental transforms to join large datasets.
Incremental computation is now supported for lightweight transforms. View an example lightweight incremental transform below.