Lightweight transforms represent a new backend for running your Python data processing pipelines, while allowing you to use most of the transform API you are already familiar with.
As individual computers become more powerful, an increasing number of data transformations can be run on a single node. This means that, in the case of small-to-medium sized datasets, transformations can be executed without relying on distributed parallelism. This approach can reduce the overhead associated with the distributed orchestration of Spark executors and enable the use of single-node alternatives for authoring data pipelines, such as Polars ↗ or DuckDB ↗.
As we continue to expand our lightweight transforms capabilities, we recommend that you always upgrade your repositories to version 5.400.0 or similar of the foundry-transforms-lib-python
to stay updated with our latest features:
Lightweight transforms are built on top of the container orchestration infrastructure, which must be present on your Foundry enrollment to use its features.
This example shows how to use a lightweight transform in a Python transform pipeline. Suppose we have the following Spark pipeline using pandas via @transform_pandas
:
Copied!1 2 3 4 5 6 7 8 9 10 11 12
from transforms.api import transform_pandas, Input, Output @transform_pandas( Output('/Project/folder/output'), df=Input('/Project/folder/input'), ) def compute(df): return ( df[df['Name'].str.startswith("A")] .loc[:, ['Name', 'Age']] .sort_values(by="Age") )
To turn this into a lightweight transform, you need to:
foundry-transforms-lib-python
from the Libraries tab.@lightweight
on top of your existing decorators, as shown in the following code snippet:Copied!1 2 3 4 5 6 7 8 9 10 11 12 13
from transforms.api import transform_pandas, Input, Output, lightweight @lightweight @transform_pandas( Output('/Project/folder/output'), df=Input('/Project/folder/input'), ) def compute(df): return ( df[df['Name'].str.startswith("A")] .loc[:, ['Name', 'Age']] .sort_values(by="Age") )
Moving to a lightweight transform, as shown above, approximately doubles the speed of the transform on small data.
@lightweight
, as shown above, is only compatible with either @transform_pandas
or with @transform
pipelines which only rely on the .pandas()
method.
Next, we can enable our transform to use Polars for improved scalability.
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
import polars as pl from transforms.api import transform, Input, Output, lightweight @lightweight @transform( # we've gone from @transform_pandas to @transform output=Output('/Project/folder/output'), dataset=Input('/Project/folder/input'), ) def compute(output, dataset): output.write_table(dataset .polars() .filter(pl.col('Name').str.starts_with('A')) .select(['Name', 'Age']) .sort(by='Age') )
This pipeline now uses all your available CPU cores and is also equipped with Polars' query optimization engine that can eliminate unnecessary operations and find more efficient algorithms for executing your operations than Pandas.
To learn more about lightweight transforms, continue to the lightweight transforms APIs documentation.