Python transforms provides support for aborting a transaction to allow a job to successfully complete if the output dataset is unchanged (where no new data is written to the dataset). This is achieved through using the transform
decorator and calling .abort()
on the TransformOutput
object.
Aborting transactions can be used if you need to prevent the output dataset and downstream datasets from updating under certain conditions. As soon as your output dataset updates, the downstream datasets will be considered out-of-date (stale), and they will update when they are next built (either manually or via a scheduled build). It provides an alternative to failing the build. This makes it easier to identify when something is actually failing.
Aborted transactions will appear as grayed-out, successful jobs in your dataset transaction history. This enables you to differentiate at a glance whether a successful build resulted in a committed transaction or not.
Examples of when you may want to abort a transaction:
Adding a validation dataset that makes use of abort()
after a dataset that always updates but does not always result in changed output saves compute resources by avoiding unnecessary updates downstream.
Below is a simple notional example, where we may want to make sure that the dataset is only updated if data for today has arrived in our input dataset.
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14
from transforms.api import transform, Input, Output from datetime import date @transform( holiday_trips=Input('/examples/trips'), processed=Output('/examples/trips_processed') ) def update_daily_trips(holiday_trips, processed): holiday_trips_df = holiday_trips.dataframe() todays_trips_df = holiday_trips_df.filter(holiday_trips_df.trip_date == date.today()) if (todays_trips_df.count() == 0): processed.abort() else: processed.write_dataframe(todays_trips_df)
Using if (len(todays_trips_df.head(1)) == 0)
will usually return a faster result than if (todays_trips_df.count() == 0)
as it will only check the existence of at least one row, rather than counting all rows unnecessarily.
When a job is marked as "ignored" the computation does not run as Foundry determines that the jobspecs are not stale. When a transaction is aborted, the job does run and it completes successfully, however the output dataset is left unchanged and no transaction is committed.
Incremental transforms read the dataset view of both the inputs and outputs only using committed transactions. This means they will ignore aborted transactions when performing incremental computation.
When a transaction is explicitly aborted on all outputs of an incremental transform, the next build will read (and therefore reprocess) the inputs as if the aborted transaction never occurred, and thus be able to run incrementally. If a transaction is only aborted on a subset of the outputs, the build will not be able to run incrementally. For the outputs with aborted transactions, the output job specs will be using a previous input transaction range as aborted transactions are ignored. For the outputs with committed transactions, the output job specs will be using the current input transaction range. This mismatch in input transaction range means the transform can no longer run incrementally.
In a multi-output incremental transform, if a transaction is explicitly aborted on a subset of the outputs, the next build will run as a snapshot, with the failed incremental computation check Provenance records for the previous build are inconsistent
. If require_incremental=True
is set, the build will fail with the error InconsistentProvenanceRecords
. This is because the current view of the outputs will now have been produced by different input transactions.