Transforms classes

ClassDescription
CheckWraps up an Expectation such that it can be registered with Data Health.
FileStatusA collections.namedtuple capturing details about a FoundryFS file.
FileSystem(foundry_fs[, read_only])A filesystem object for reading and writing dataset files.
IncrementalTransformContext(ctx, is_incremental)TransformContext with added functionality for incremental computation.
IncrementalTransformInput(tinput[, prev_txrid])TransformInput with added functionality for incremental computation.
IncrementalTransformOutput(toutput[, …])TransformOutput with added functionality for incremental computation.
Input(alias)Specification of a transform input.
Output(alias[, sever_permissions])Specification of a transform output.
Pipeline()An object for grouping a collection of Transform objects.
Transform(compute_func[, inputs, outputs, ...])A callable object that describes single step of computation.
TransformContext(foundry_connector[, parameters])Context object that can optionally be injected into the compute function of a transform.
TransformInput(rid, branch, txrange, …)The input object passed into Transform objects at runtime.
LightweightInput(alias)The input object passed into Lightweight Transform objects at runtime.
IncrementalLightweightInput(alias)The input object passed into an incremental Lightweight Transform objects at runtime.
TransformOutput(rid, branch, txrid, …)The output object passed into Transform objects at runtime.
LightweightOutput(alias)The input object passed into Lightweight Transform objects at runtime.

Check

class transforms.api.Check

Wraps up an Expectation such that it can be registered with Data Health.

  • expectation
    • Expectation – The expectation to evaluate.
  • name
    • str – The name of the check, used as a stable identifier over time.
  • is_incremental
    • bool – If the transform is running incrementally.
  • on_error
    • (str, Optional) – What action to take if the expectation is not met. Currently 'WARN', 'FAIL'.
  • description
    • (str, Optional) – The description of the check.

FileStatus

class transforms.api.FileStatus

A collections.namedtuple capturing details about a FoundryFS file.

Create new instance of FileStatus(path, size, modified)

  • count(value) → integer -- return number of occurrences of value
  • index(value[, start[, stop]]) → integer -- return first index of value
    • Raises ValueError if the value is not present
  • modified
    • Alias for field number 2
  • path
    • Alias for field number 0
  • size
    • Alias for field number 1

FileSystem

class transforms.api.FileSystem(foundry_fs, read_only=False)

A filesystem object for reading and writing dataset files.

  • files(glob=None, regex='.*', show_hidden=False, packing_heuristic=None)
    • Create a DataFrame containing the paths accessible within this dataset.
    • The DataFrame is partitioned by file size where each partition contains file paths whose combined size is at most spark.files.maxPartitionBytes bytes, or by a single file if it is larger than spark.files.maxPartitionBytes. The size of a file is calculated as its on-disk file size plus the spark.files.openCostInBytes.
    • Parameters
      • glob (str ↗, optional) – A unix file matching pattern. Supports globstar; to search for files (e.g. pdf), recursively use **/*.pdf.
      • regex (str ↗, optional) – A regex pattern against which to match filenames.
      • show_hidden (bool ↗, optional) – Include hidden files, those prefixed with . or _.
      • packing_heuristic (str ↗, optional) – Specify a heuristic to use for bin-packing files into Spark partitions. Possible choices are ffd (First Fit Decreasing) or wfd (Worst Fit Decreasing). While wfd tends to produce a less even distribution, it is much faster, so wfd is recommended for datasets containing a very large number of files. If a heuristic is not specified, one will be selected automatically.
    • Returns
      • DataFrame of (path, size, modified)
    • Return type
      • pyspark.sql.DataFrame
  • ls(glob=None, regex='.*', show_hidden=False)
    • Recurses through all directories and lists all files matching the given patterns, starting from the root directory of the dataset.
    • Parameters
      • glob (str ↗, optional) – A unix file matching pattern. Supports globstar; to search for files (e.g. pdf), recursively use **/*.pdf.
      • regex (str ↗, optional) – A regex pattern against which to match filenames.
      • show_hidden (bool ↗, optional) – Include hidden files, those prefixed with . or _.
    • Yields
      • FileStatus - The logical path, file size (bytes), modified timestamp (ms since January 1, 1970 UTC).
  • open(_path, mode='r', kwargs)
    • Open a FoundryFS file in the given mode. kwargs are keyword arguments.
    • Parameters
      • path (str ↗) – The logical path of the file in the dataset.
      • kwargs – Remaining keyword arguments passed to io.open()
      • show_hidden (bool ↗, optional) – Include hidden files, those prefixed with . or _.
    • Returns
      • A Python file-like object attached to the stream.
    • Return type
      • File

IncrementalTransformContext

class transforms.api.IncrementalTransformContext(ctx, is_incremental)

TransformContext with added functionality for incremental computation.

  • auth_header
    • str – The auth header used to run the transform.
  • fallback_branches
    • List[str] – The fallback branches configured when running the transform.
  • is_incremental
    • bool – If the transform is running incrementally.
  • parameters
    • dict of (str, any) – Transform parameters.
  • spark_session
    • pyspark.sql.SparkSession – The Spark session used to run the transform.

IncrementalTransformInput

class transforms.api.IncrementalTransformInput(tinput, prev_txrid=None)

TransformInput with added functionality for incremental computation.

  • dataframe(mode='added')
    • Return a pyspark.sql.DataFrame for the given read mode.
    • Only current, previous and added modes are supported.
    • Parameters
      • mode (str ↗, optional) – The read mode, one of current, previous, added, modified, removed. Defaults to added
    • Returns
      • The dataframe for the dataset.
    • Return type
  • filesystem(mode='added')
    • Construct a FileSystem object for reading from FoundryFS for the given read mode.
    • Only current, previous and added modes are supported.
    • Parameters
      • mode (str ↗, optional) – The read mode, one of current, previous, added, modified, removed. Defaults to added
    • Returns
      • A filesystem object for the given view.
    • Return type
  • pandas()
  • branch
    • str – The branch of the input dataset.
  • path
    • str – The Project path of the input dataset.
  • rid
    • str – The resource identifier of the dataset.

IncrementalTransformOutput

class transforms.api.IncrementalTransformOutput(toutput, prev_txrid=None, mode='replace')

TransformOutput with added functionality for incremental computation.

  • abort()
    • Aborts the transaction, allowing the job to complete successfully without writing any data. See Python Abort for more details.
  • dataframe(mode='current', schema=None)
    • Return a pyspark.sql.DataFrame ↗ for the given read mode.
    • Parameters
      • mode (str ↗, optional) – The read mode, one of current, previous, added, modified, removed. Defaults to current.
      • schema (pyspark.types.StructType, optional) - A PySpark schema to use when constructing an empty dataframe. Required when using read mode ‘previous’.
    • Returns
      • The dataframe for the dataset.
    • Return type
    • Raises
      • ValueError - If no schema is passed when using mode ‘previous’
  • filesystem(mode='current')
    • Construct a FileSystem object for writing to FoundryFS.
    • Parameters
      • mode (str ↗, optional) – The read mode, one of added, current or previous. Defaults to current. Only the current filesystem is writable.
    • Raises
  • pandas(mode='current')
  • set_mode(mode)
    • Change the write mode of the dataset.
    • Parameters
      • mode (str ↗) – The write mode, one of ‘replace’ or ‘modify’. In modify mode anything written to the output is appended to the dataset. In replace mode, anything written to the output replaces the dataset.

The write mode cannot be changed after data has been written.

  • write_dataframe(df, partition_cols=None, bucket_cols=None, bucket_count=None, sort_by=None, output_format=None, options=None)
    • Write the given DataFrame ↗ to the output dataset.
    • Parameters
      • df (_pyspark.sql.DataFrame ↗) – The PySpark dataframe to write.
      • partition_cols (List[str ↗], optional) - Column partitioning to use when writing data.
      • bucket_cols (List[str ↗], optional) – The columns by which to bucket the data. Must be specified if bucket_count is given.
      • bucket_count (int ↗, optional) – The number of buckets. Must be specified if bucket_cols is given.
      • sort_by (List[str ↗], optional) – The columns by which to sort the bucketed data.
      • output_format (str ↗, optional) – The output file format, defaults to ‘parquet’.
      • options (dict ↗, optional) – Extra options to pass through to org.apache.spark.sql.DataFrameWriter#option(String, String).
  • write_pandas(pandas_df)
  • branch
    • str – The branch of the dataset.
  • path
    • str – The Project path of the dataset.
  • rid
    • str – The resource identifier of the dataset.

Input

class transforms.api.Input(alias, branch, stop_propagating, stop_requiring, checks)

Specification of a transform input.

  • Parameters
    • alias (str ↗, optional) – Dataset rid or the absolute Project path of the dataset. If not specified, parameter is unbound.
    • branch (str ↗, optional): Branch name to resolve the input dataset to. If not specified, resolved at build-time.
    • stop_propagating (Markings, optional): Security markings to stop propagating from this. See the Markings and remove inherited markings documentation.
    • stop_requiring (OrgMarkings, optional): Org markings to assume on this input.
    • checks (List[Check], Check, optional): One or more :class:Check objects.
    • failure_strategy (str ↗, optional): Strategy in case the input fails to update. Must be either continue or fail. If not specified, defaults to fail.

Output

class transforms.api.Output(alias=None, sever_permissions=False, checks=None)

Specification of a transform output.

  • Parameters
    • alias (str ↗, optional) - Dataset rid or the absolute Project path of the dataset. If not specified, parameter is unbound.
    • sever_permissions (bool ↗, optional) - If true, severs the permissions of the dataset from the permissions of its inputs. Ignored if parameter is unbound
    • checks (List[Check], Check, optional) - One or more :class:Check objects.

Pipeline

class transforms.api.Pipeline

An object for grouping a collection of Transform objects.

  • add_transforms(*transforms)
    • Register the given Transform objects to the Pipeline instance.
    • Parameters
      • transforms (Transform) – The transforms to register.
    • Raises
  • discover_transforms(*modules)
    • Recursively find and import modules registering every module-level transform.
    • This method recursively finds and imports modules starting at the given module’s path. Each module found is imported and any attribute that is an instance of Transform (as constructed by the transforms decorators) will be registered to the pipeline.
    • Parameters
      • modules (module) – The modules to start searching from.
Copied!
1 2 3 >>> import myproject >>> p = Pipeline() >>> p.discover_transforms(myproject)

Each module found is imported. Try to avoid executing code at the module-level.

  • transforms
    • List[Transform] – The list of transforms registered to the pipeline.

Transform

class transforms.api.Transform(compute_func, inputs=None, outputs=None, profile=None)

A callable object that describes single step of computation.

A Transform consists of a number of Input specs, a number of Output specs, and a compute function.

It is idiomatic to construct a Transform object using the provided decorators: transform(), transform_df(), and transform_pandas().

Note: The original compute function is exposed via the Transform’s __call__ method.

  • Parameters

    • compute_func (Callable) – The compute function to wrap.
    • inputs (Dict[str ↗, Input]) - A dictionary mapping input names to Input specs.
    • outputs (Dict[str ↗, Output]) - A dictionary mapping input names to Output specs.
    • profile (str ↗, optional) – The name of the Transforms profile to use at runtime.
  • compute(ctx=None, _kwargs_)**

    • Compute the transform with a context and set of inputs and outputs.
    • Parameters
      • ctx (TransformContext, optional) – A context object passed to the transform if it requests one.
      • kwargs (TransformInput or TransformOutput) - A dictionary mapping input names to Input specs. kwarg is a shorthand for keyword arguments.
      • outputs (Dict[str ↗, Output]) - The input, output and context objects to be passed by name into the compute function.
    • Returns
      • The output objects after running the transform.
    • Return Type
  • version

    • str – A string that is used to compare two versions of a transform when considering logic staleness.
    • For example, a SQL transform may take the hash of the SQL query. Ideally the SQL query would be transformed into a format that yields the same version for transforms of equivalent semantic meaning. I.e. the SQL query select A, B from foo; should be the same version as the SQL query select A, B from (select * from foo);.
    • If no version is specified, the version of the repository will be used.
    • Raises
      • ValueError – If fails to compute object hash of compute function

TransformContext

class transforms.api.TransformContext(foundry_connector, parameters=None) Context object that can optionally be injected into the compute function of a transform.

  • auth_header
    • str – The auth header used to run the transform. This auth header has a limited scope and only has the permissions required to run the job. It should not be used for API calls.
  • fallback_branches
    • List[str] – The fallback branches configured when running the transform.
  • parameters
    • dict of (str, any) – Transform parameters.
  • spark_session
    • pyspark.sql.SparkSession – The Spark session used to run the transform.

TransformInput

class transforms.api.TransformInput(rid, branch, txrange, dfreader, fsbuilder)

The input object passed into Transform objects at runtime.

  • dataframe()
  • filesystem()
    • Construct a FileSystem object for reading from FoundryFS.
    • Returns
      • A FileSystem object for reading from Foundry.
    • Return Type
  • pandas()
  • branch
    • str – The branch of the input dataset.
  • path
    • str – The Project path of the input dataset.
  • rid
    • str – The resource identifier of the dataset.
  • column_descriptions
    • Dict<str, str> – The column descriptions of the dataset.
  • column_typeclasses
    • Dict<str, str> – The column type classes of the dataset.

LightweightInput

class transforms.api.LightweightInput(alias)

Its aim is to mimic a subset of the API of TransformInput by delegating to the Foundry Data Sidecar while extending it with support for various data formats.


IncrementalLightweightInput

class transforms.api.IncrementalLightweightInput(alias)

Its aim is to mimic a subset of the API of IncrementalTransformInput by delegating to the Foundry Data Sidecar while extending it with support for various data formats. It's the incremental counterpart of LightweightInput

  • dataframe(mode)
    • Return a pandas.DataFrame ↗ with the dataset. It's an alias to pandas()
    • Parameters
      • mode (str ↗, optional) – The read mode, one of current, previous, added, modified, removed. Defaults to added.
    • Returns
      • The dataframe for the dataset.
    • Return Type
  • filesystem()
    • Construct a FileSystem object for reading from FoundryFS.
    • Returns
      • A FileSystem object for reading from Foundry.
    • Return Type
  • pandas()(mode)
    • Return a pandas.DataFrame ↗ with the dataset.
    • Parameters
      • mode (str ↗, optional) – The read mode, one of current, previous, added, modified, removed. Defaults to added.
    • Returns
      • The dataframe for the dataset.
    • Return Type
  • arrow()(mode)
    • Return a pyarrow.Table ↗ with the dataset.
    • Parameters
      • mode (str ↗, optional) – The read mode, one of current, previous, added, modified, removed. Defaults to added.
    • Returns
      • The table for the dataset.
    • Return Type
  • polars(lazy=False, mode)
  • path(mode)
    • Return a str ↗ with the path to the downloaded dataset's files which may be CSVs, Parquet or Avro files in a folder.
    • Parameters
      • mode (str ↗, optional) – The read mode, one of current, previous, added, modified, removed. Defaults to added
    • Returns
      • Path to the directory containing the dataset's files.
    • Return Type

TransformOutput

class transforms.api.TransformOutput(rid, branch, txrid, dfreader, dfwriter, fsbuilder)

The output object passed into Transform objects at runtime.

  • abort()
    • Aborts the transaction, allowing the job to complete successfully without writing any data. See Python Abort for more details.
  • dataframe()
  • filesystem()
    • Construct a FileSystem object for writing to FoundryFS.
    • Returns
      • A FileSystem object for writing to Foundry.
  • Return Type
  • pandas()
  • set_mode(mode)
    • Change the write mode of the dataset.
    • Parameters
      • mode (str ↗) – The write mode, one of ‘replace’ or ‘modify’. In modify mode anything written to the output is appended to the dataset. In replace mode, anything written to the output replaces the dataset.
  • write_dataframe(df, partition_cols=None, bucket_cols=None, bucket_count=None, sort_by=None, output_format=None, options=None, column_descriptions=None, column_typeclasses=None)
    • Write the given DataFrame ↗ to the output dataset.
    • Parameters
      • df (pyspark.sql.DataFrame ↗) – The PySpark dataframe to write.
      • partition_cols (List[str ↗], optional) - Column partitioning to use when writing data.
      • bucket_cols (List[str ↗], optional) - The columns by which to bucket the data. Must be specified if bucket_count is given.
      • bucket_count (int ↗, optional) – The number of buckets. Must be specified if bucket_cols is given.
      • sort_by (List[str ↗], optional) - The columns by which to sort the bucketed data.
      • output_format (str ↗, optional) - The output file format, defaults to 'parquet'. The file formats are based on Spark's DataFrameWriter ↗ and other types include 'csv', 'json', 'orc' and 'text'.
      • options (dict ↗, optional) - Extra options to pass through to org.apache.spark.sql.DataFrameWriter#option(String, String).
      • column_descriptions (Dict[str ↗, str ↗], optional) - Map of column names to their string descriptions. This map is intersected with the columns of the DataFrame, and must include descriptions (max 800 characters).
      • column_typeclasses (Dict[str ↗, List[Dict[str ↗, str ↗]], optional) - Map of column names to their column type classes. Each type class in the List is a Dict[str ↗, str ↗], where only two keys are valid: "name" and "kind". Each of these keys map to the corresponding string the user wants.
  • write_pandas(pandas_df)
  • branch
    • str – The branch of the dataset.
  • path
    • str – The Project path of the dataset.
  • rid
    • str – The resource identifier of the dataset.

LightweightOutput

class transforms.api.LightweightInput(alias)

Its aim is to mimic a subset of the API of TransformOutput by delegating to the Foundry Data Sidecar.