Abort transactions

Python transforms provides support for aborting a transaction to allow a job to successfully complete if the output dataset is unchanged (where no new data is written to the dataset). This is achieved through using the transform decorator and calling .abort() on the TransformOutput object.

Aborting transactions can be used if you need to prevent the output dataset and downstream datasets from updating under certain conditions. As soon as your output dataset updates, the downstream datasets will be considered out-of-date (stale), and they will update when they are next built (either manually or via a scheduled build). It provides an alternative to failing the build. This makes it easier to identify when something is actually failing.

Aborted transactions will appear as grayed-out, successful jobs in your dataset transaction history. This enables you to differentiate at a glance whether a successful build resulted in a committed transaction or not.

transaction committed vs aborted

Examples of when you may want to abort a transaction:

  • You have a custom condition for updating your dataset that is based on the contents of the input datasets.
  • Your dataset requires force building because it does not become stale. An example is datasets that fetch data from making API calls instead of input datasets.
  • You have a writeback dataset which always updates when it is scheduled (see further below for detailed example).

Adding a validation dataset that makes use of abort() after a dataset that always updates but does not always result in changed output saves compute resources by avoiding unnecessary updates downstream.

Example: notional custom condition

Below is a simple notional example, where we may want to make sure that the dataset is only updated if data for today has arrived in our input dataset.

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 from transforms.api import transform, Input, Output from datetime import date @transform( holiday_trips=Input('/examples/trips'), processed=Output('/examples/trips_processed') ) def update_daily_trips(holiday_trips, processed): holiday_trips_df = holiday_trips.dataframe() todays_trips_df = holiday_trips_df.filter(holiday_trips_df.trip_date == date.today()) if (todays_trips_df.count() == 0): processed.abort() else: processed.write_dataframe(todays_trips_df)
Tip

Using if (len(todays_trips_df.head(1)) == 0) will usually return a faster result than if (todays_trips_df.count() == 0) as it will only check the existence of at least one row, rather than counting all rows unnecessarily.

How does an aborted transaction differ from an ignored job?

When a job is marked as "ignored" the computation does not run as Foundry determines that the jobspecs are not stale. When a transaction is aborted, the job does run and it completes successfully, however the output dataset is left unchanged and no transaction is committed.

transaction aborted vs job ignored

How do aborted transactions relate to incremental transactions?

Incremental transforms read the dataset view of both the inputs and outputs only using committed transactions. This means they will ignore aborted transactions when performing incremental computation.

When a transaction is explicitly aborted on all outputs of an incremental transform, the next build will read (and therefore reprocess) the inputs as if the aborted transaction never occurred, and thus be able to run incrementally. If a transaction is only aborted on a subset of the outputs, the build will not be able to run incrementally. For the outputs with aborted transactions, the output job specs will be using a previous input transaction range as aborted transactions are ignored. For the outputs with committed transactions, the output job specs will be using the current input transaction range. This mismatch in input transaction range means the transform can no longer run incrementally.

Warning

In a multi-output incremental transform, if a transaction is explicitly aborted on a subset of the outputs, the next build will run as a snapshot, with the failed incremental computation check Provenance records for the previous build are inconsistent. If require_incremental=True is set, the build will fail with the error InconsistentProvenanceRecords. This is because the current view of the outputs will now have been produced by different input transactions.