Class | Description |
---|---|
Check | Wraps up an Expectation such that it can be registered with Data Health. |
FileStatus | A collections.namedtuple capturing details about a FoundryFS file. |
FileSystem(foundry_fs[, read_only]) | A filesystem object for reading and writing dataset files. |
IncrementalTransformContext (ctx, is_incremental) | TransformContext with added functionality for incremental computation. |
IncrementalTransformInput (tinput[, prev_txrid]) | TransformInput with added functionality for incremental computation. |
IncrementalTransformOutput (toutput[, …]) | TransformOutput with added functionality for incremental computation. |
Input (alias) | Specification of a transform input. |
Output (alias[, sever_permissions]) | Specification of a transform output. |
Pipeline () | An object for grouping a collection of Transform objects. |
Transform (compute_func[, inputs, outputs, ...]) | A callable object that describes single step of computation. |
TransformContext (foundry_connector[, parameters]) | Context object that can optionally be injected into the compute function of a transform. |
TransformInput (rid, branch, txrange, …) | The input object passed into Transform objects at runtime. |
LightweightInput (alias) | The input object passed into Lightweight Transform objects at runtime. |
IncrementalLightweightInput (alias) | The input object passed into an incremental Lightweight Transform objects at runtime. |
TransformOutput (rid, branch, txrid, …) | The output object passed into Transform objects at runtime. |
LightweightOutput (alias) | The input object passed into Lightweight Transform objects at runtime. |
Check
transforms.api.Check
Wraps up an Expectation such that it can be registered with Data Health.
expectation
name
is_incremental
on_error
description
FileStatus
class transforms.api.FileStatus
A collections.namedtuple
capturing details about a FoundryFS file.
Create new instance of FileStatus(path, size, modified)
count
(value) → integer -- return number of occurrences of valueindex
(value[, start[, stop]]) → integer -- return first index of value
modified
path
size
FileSystem
class transforms.api.FileSystem
(foundry_fs, read_only=False)
A filesystem object for reading and writing dataset files.
files
(glob=None, regex='.*', show_hidden=False, packing_heuristic=None)
DataFrame
↗ containing the paths accessible within this dataset.DataFrame
↗ is partitioned by file size where each partition contains file paths whose combined size is at most spark.files.maxPartitionBytes
bytes, or by a single file if it is larger than spark.files.maxPartitionBytes
. The size of a file is calculated as its on-disk file size plus the spark.files.openCostInBytes
.pdf
), recursively use **/*.pdf
..
or _
.ffd
(First Fit Decreasing) or wfd
(Worst Fit Decreasing). While wfd
tends to produce a less even distribution, it is much faster, so wfd
is recommended for datasets containing a very large number of files. If a heuristic is not specified, one will be selected automatically.ls
(glob=None, regex='.*', show_hidden=False)
FileStatus
- The logical path, file size (bytes), modified timestamp (ms since January 1, 1970 UTC).open
(_path, mode='r', kwargs)
kwargs
are keyword arguments.io.open()
↗.
or _
.IncrementalTransformContext
transforms.api.IncrementalTransformContext
(ctx, is_incremental)TransformContext with added functionality for incremental computation.
auth_header
fallback_branches
is_incremental
parameters
spark_session
IncrementalTransformInput
transforms.api.IncrementalTransformInput
(tinput, prev_txrid=None)TransformInput with added functionality for incremental computation.
dataframe
(mode='added')
pyspark.sql.DataFrame
for the given read mode.filesystem
(mode='added')
pandas()
branch
path
rid
IncrementalTransformOutput
class transforms.api.IncrementalTransformOutput
(toutput, prev_txrid=None, mode='replace')
TransformOutput with added functionality for incremental computation.
abort()
dataframe
(mode='current', schema=None)
ValueError
↗ - If no schema is passed when using mode ‘previous’filesystem
(mode='current')
NotImplementedError
↗ – Not currently supported.pandas
(mode='current')
set_mode
(mode)
The write mode cannot be changed after data has been written.
write_dataframe
(df, partition_cols=None, bucket_cols=None, bucket_count=None, sort_by=None, output_format=None, options=None)
org.apache.spark.sql.DataFrameWriter#option(String, String)
.write_pandas
(pandas_df)
branch
path
rid
Input
class transforms.api.Input
(alias, branch, stop_propagating, stop_requiring, checks)
Specification of a transform input.
Check
objects.continue
or fail
. If not specified, defaults to fail
.Output
class transforms.api.Output
(alias=None, sever_permissions=False, checks=None)
Specification of a transform output.
Check
objects.Pipeline
class transforms.api.Pipeline
An object for grouping a collection of Transform objects.
add_transforms
(*transforms)
ValueError
↗ – If multiple Transform
objects write to the same Output
alias.discover_transforms
(*modules)
Transform
(as constructed by the transforms decorators) will be registered to the pipeline.Copied!1 2 3
>>> import myproject >>> p = Pipeline() >>> p.discover_transforms(myproject)
Each module found is imported. Try to avoid executing code at the module-level.
transforms
Transform
class transforms.api.Transform
(compute_func, inputs=None, outputs=None, profile=None)
A callable object that describes single step of computation.
A Transform consists of a number of Input
specs, a number of Output
specs, and a compute function.
It is idiomatic to construct a Transform object using the provided decorators: transform()
, transform_df()
, and transform_pandas()
.
Note: The original compute function is exposed via the Transform’s __call__
method.
Parameters
compute
(ctx=None, _kwargs_)**
Input
specs. kwarg is a shorthand for keyword arguments.version
select A, B from foo;
should be the same version as the SQL query select A, B from (select * from foo);
.ValueError
↗ – If fails to compute object hash of compute functionTransformContext
class transforms.api.TransformContext
(foundry_connector, parameters=None)
Context object that can optionally be injected into the compute function of a transform.
auth_header
fallback_branches
parameters
spark_session
TransformInput
class transforms.api.TransformInput
(rid, branch, txrange, dfreader, fsbuilder)
The input object passed into Transform objects at runtime.
dataframe()
filesystem()
pandas()
branch
path
rid
column_descriptions
column_typeclasses
LightweightInput
class transforms.api.LightweightInput
(alias)
Its aim is to mimic a subset of the API of TransformInput
by delegating to the Foundry Data Sidecar while extending it with support for various data formats.
dataframe()
pandas()
filesystem()
pandas()
arrow()
polars(lazy: Optional[bool]=False)
lazy
parameter.path()
IncrementalLightweightInput
class transforms.api.IncrementalLightweightInput
(alias)
Its aim is to mimic a subset of the API of IncrementalTransformInput
by delegating to the Foundry Data Sidecar while extending it with support for various data formats. It's the incremental counterpart of LightweightInput
dataframe
(mode)
pandas()
filesystem()
pandas()
(mode)
arrow()
(mode)
polars
(lazy=False, mode)
lazy
parameter.path
(mode)
TransformOutput
class transforms.api.TransformOutput
(rid, branch, txrid, dfreader, dfwriter, fsbuilder)
The output object passed into Transform objects at runtime.
abort()
dataframe()
filesystem()
pandas()
set_mode
(mode)
write_dataframe
(df, partition_cols=None, bucket_cols=None, bucket_count=None, sort_by=None, output_format=None, options=None, column_descriptions=None, column_typeclasses=None)
bucket_count
is given.bucket_cols
is given.org.apache.spark.sql.DataFrameWriter#option(String, String)
.write_pandas
(pandas_df)
branch
path
rid
LightweightOutput
class transforms.api.LightweightInput
(alias)
Its aim is to mimic a subset of the API of TransformOutput
by delegating to the Foundry Data Sidecar.
filesystem()
dataframe()
pandas()
pandas()
arrow()
polars(lazy: Optional[bool]=False)
lazy
parameter.path()
write_pandas
(pandas_df)
write_table
.write_table
(df)
path
to the output dataset. In case a path
is given (either as str
or pathlib.Path
), it's value must match the value returned by path_for_write_table
.path
) – The dataframe to write.path_for_write_table
write_table
.set_mode
(mode)