The instructions below step through a simple Python data transformation. If you are just getting started with data transformation, consider going through the batch pipeline tutorial for Pipeline Builder or Code Repositories first.
This tutorial walks through how you can use Transforms Python to transform a spreadsheet of recent meteorite discoveries to a usable dataset ready for analysis.
This tutorial uses data from NASA’s Open Data Portal ↗. You can follow along on your own Code Repository with this sample dataset:
This dataset contains data about meteorites that have been found on Earth. Note that the data has been cleaned to make it easier to work with.
The dataset includes name, mass, classification, and other identifying information for each meteorite, along with the year it was discovered and coordinates of where it was found. It’s good practice to open the CSV to review the data before uploading it into Foundry.
Get started by creating a Python code repository.
Alternatively, you can copy your local Python repository into Code Repositories with the following steps:
git remote remove origin
git remote add origin <repository_url>
You can find your code repository URL in the top right corner of the GitHub interface. Select the green Clone button, then copy the Git remote URL.
Confirm this by running git remote -v
to return the code repository URL.
master
branch (or another branch of your choosing) in Code Repositories into your local branch: git merge master
If an error occurs about refusing to merge unrelated histories
, run the command: git merge master --allow-unrelated-histories
. This will remove the current Git history associated with your previous remote GitHub repository.
This merge will bring essential files to your local repository that are required to make commits and changes in Code Repositories.
testbranch
, for example): git checkout -b testbranch
.git push
, and confirm that the new branch appears in the Code Repositories interface. Verify that checks are successful.Learn more about local development in Code Repositories.
Navigate to your Transforms Python repository. The default examples.py
file contains example code to help you get started.
Start by creating a new file in src/myproject/datasets
, and call it meteor_analysis.py
to organize your analysis. Make sure you import the required functions and classes. Define a transformation that takes your meteor_landings
dataset as input and creates meteor_landings_cleaned
as its output:
Copied!1 2 3 4 5 6 7 8 9 10 11
from transforms.api import transform_df, Input, Output from pyspark.sql import functions as F @transform_df( # replace this with your output dataset path Output("/Users/jsmith/meteorite_landings_cleaned"), # replace this with your input dataset path meteorite_landings=Input("/Users/jsmith/meteorite_landings"), ) def clean(meteorite_landings): <your data transformation logic>
Now, suppose you want to filter your input dataset down to any “Valid” meteors that happened after the year 1950. Update your data transformation logic to filter the meteorites by nametype
and year
:
Copied!1 2 3 4 5 6
def clean(meteorite_landings): return meteorite_landings.filter( meteorite_landings.nametype == 'Valid' ).filter( meteorite_landings.year >= 1950 )
To build your resulting dataset, commit your changes and select Build in the top right corner. For more information about building datasets in Code Repositories, review the Create a simple batch pipeline tutorial.
With Python Transforms, you can create multiple output datasets in a single Python file.
Let’s say you want to filter down even further to only meteors that were particularly large for their meteorite type. To do so, you will need to:
First, add a data transformation to meteor_analysis.py
that finds the average mass for each meteorite type. This transformation takes your meteor_landings
dataset as input and creates meteorite_stats
as its output:
Copied!1 2 3 4 5 6 7 8 9
@transform_df( # output dataset name must be unique Output("/Users/jsmith/meteorite_stats"), meteorite_landings=Input("/Users/jsmith/meteorite_landings"), ) def stats(meteorite_landings): return meteorite_landings.groupBy("class").agg( F.mean("mass").alias("avg_mass_per_class") )
Next, create a data transformation that compares each meteor’s mass to the average mass for its meteor type. The information needed for this transformation is spread across the meteorite_landings
and meteorite_stats
tables that you’ve created so far in this tutorial. You must join the two datasets together and then filter to find meteorites that have a greater-than-average mass:
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14
# this data transformation is based on two input datasets @transform_df( Output("/Users/jsmith/meteorite_enriched"), meteorite_landings=Input("/Users/jsmith/meteorite_landings"), meteorite_stats=Input("/Users/jsmith/meteorite_stats") ) def enriched(meteorite_landings, meteorite_stats): enriched_together=meteorite_landings.join( meteorite_stats, "class" ) greater_mass=enriched_together.withColumn( 'greater_mass', (enriched_together.mass > enriched_together.avg_mass_per_class) ) return greater_mass.filter("greater_mass")
Now, you can further analyze your resulting meteorite_enriched
dataset by exploring it in Contour.
So far, you’ve created a dataset that contains all types of meteorites that have a greater-than-average mass. Let’s say you want to create separate datasets for each meteorite type. With Transforms Python, you can use a for loop to apply the same data transformation to each type of meteorite. For more information about applying the same data transformation to different inputs, refer to the section on Transform generation.
Create a new file in src/myproject/datasets
and call it meteor_class.py
. Note that you can continue writing your data transformation code in the meteor_analysis.py
file, but this tutorial uses a new file to separate the data transformation logic.
To create separate datasets for each meteorite type, you will filter the meteorite_enriched
dataset by class. Define a transform_generator
function that applies this same data transformation logic to each meteorite type you want to analyze:
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
from transforms.api import transform_df, Input, Output def transform_generator(sources): transforms = [] for source in sources: @transform_df( # this will create a different output dataset for each meteorite type Output('/Users/jsmith/meteorite_{source}'.format(source=source)), my_input=Input('/Users/jsmith/meteorite_enriched') ) # "source=source" captures the value of the source variable in the scope of this function def filter_by_source(my_input, source=source): return my_input.filter(my_input["class"] == source) transforms.append(filter_by_source) return transforms # this will apply the data transformation logic from above to the three provided meteorite types TRANSFORMS = transform_generator(["L6", "H5", "H4"])
This will create a transformation that filters our meteorite dataset by class. Note that we must pass source=source
into the filter_by_source
function in order to capture the source
parameter in the function’s scope.
For the initial data transformation created in the meteor_analysis.py
file, you did not have to do any additional configuration to add Transforms to your project’s Pipeline. This is because the default Python project structure uses automatic registration to discover all Transform objects within your datasets
folder.
To add also this final transformation to your project’s Pipeline using automatic registration, you must add the generated transforms to a variable as a list. In the example above, we used the variable TRANSFORMS
. For more information about automatic registration and transforms generators, refer to the section on Transforms generation) in the Transforms Python documentation.