Getting started

Tip

The instructions below step through a simple Python data transformation. If you are just getting started with data transformation, consider going through the batch pipeline tutorial for Pipeline Builder or Code Repositories first.

This tutorial walks through how you can use Transforms Python to transform a spreadsheet of recent meteorite discoveries to a usable dataset ready for analysis.

About the dataset

This tutorial uses data from NASA’s Open Data Portal ↗. You can follow along on your own Code Repository with this sample dataset:

Download meteorite_landings

This dataset contains data about meteorites that have been found on Earth. Note that the data has been cleaned to make it easier to work with.

The dataset includes name, mass, classification, and other identifying information for each meteorite, along with the year it was discovered and coordinates of where it was found. It’s good practice to open the CSV to review the data before uploading it into Foundry.

Set up a Python code repository

Get started by creating a Python code repository.

  1. Navigate to a Project, and select + New > Code repository.
  2. In the Repository type section, select Data Transforms.
  3. Select Python as the Language template.
  4. Choose to Initialize repository.

Use a local Python repository

Alternatively, you can copy your local Python repository into Code Repositories with the following steps:

  1. Create a new Python code repository, as described above.
  2. On your local repository, remove your previous Git origin (if you cloned it from GitHub, for example): git remote remove origin
  3. Add your code repository's Git remote URL: git remote add origin <repository_url>

You can find your code repository URL in the top right corner of the GitHub interface. Select the green Clone button, then copy the Git remote URL.

Confirm this by running git remote -v to return the code repository URL.

  1. Merge the current master branch (or another branch of your choosing) in Code Repositories into your local branch: git merge master

If an error occurs about refusing to merge unrelated histories, run the command: git merge master --allow-unrelated-histories. This will remove the current Git history associated with your previous remote GitHub repository.

This merge will bring essential files to your local repository that are required to make commits and changes in Code Repositories.

  1. Create a new branch and name it (testbranch, for example): git checkout -b testbranch.
  2. Make your changes and commit them to your branch.
  3. Perform git push, and confirm that the new branch appears in the Code Repositories interface. Verify that checks are successful.

Learn more about local development in Code Repositories.

Write a Python data transformation

Navigate to your Transforms Python repository. The default examples.py file contains example code to help you get started. Start by creating a new file in src/myproject/datasets, and call it meteor_analysis.pyto organize your analysis. Make sure you import the required functions and classes. Define a transformation that takes your meteor_landings dataset as input and creates meteor_landings_cleaned as its output:

Copied!
1 2 3 4 5 6 7 8 9 10 11 from transforms.api import transform_df, Input, Output from pyspark.sql import functions as F @transform_df( # replace this with your output dataset path Output("/Users/jsmith/meteorite_landings_cleaned"), # replace this with your input dataset path meteorite_landings=Input("/Users/jsmith/meteorite_landings"), ) def clean(meteorite_landings): <your data transformation logic>

Now, suppose you want to filter your input dataset down to any “Valid” meteors that happened after the year 1950. Update your data transformation logic to filter the meteorites by nametype and year:

Copied!
1 2 3 4 5 6 def clean(meteorite_landings): return meteorite_landings.filter( meteorite_landings.nametype == 'Valid' ).filter( meteorite_landings.year >= 1950 )

Build your output dataset

To build your resulting dataset, commit your changes and select Build in the top right corner. For more information about building datasets in Code Repositories, review the Create a simple batch pipeline tutorial.

Add to your data transformation

With Python Transforms, you can create multiple output datasets in a single Python file.

Let’s say you want to filter down even further to only meteors that were particularly large for their meteorite type. To do so, you will need to:

  1. Find the average mass for each meteorite type, and
  2. Compare each meteor’s mass to the average mass for its meteor type.

First, add a data transformation to meteor_analysis.py that finds the average mass for each meteorite type. This transformation takes your meteor_landings dataset as input and creates meteorite_stats as its output:

Copied!
1 2 3 4 5 6 7 8 9 @transform_df( # output dataset name must be unique Output("/Users/jsmith/meteorite_stats"), meteorite_landings=Input("/Users/jsmith/meteorite_landings"), ) def stats(meteorite_landings): return meteorite_landings.groupBy("class").agg( F.mean("mass").alias("avg_mass_per_class") )

Next, create a data transformation that compares each meteor’s mass to the average mass for its meteor type. The information needed for this transformation is spread across the meteorite_landings and meteorite_stats tables that you’ve created so far in this tutorial. You must join the two datasets together and then filter to find meteorites that have a greater-than-average mass:

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 # this data transformation is based on two input datasets @transform_df( Output("/Users/jsmith/meteorite_enriched"), meteorite_landings=Input("/Users/jsmith/meteorite_landings"), meteorite_stats=Input("/Users/jsmith/meteorite_stats") ) def enriched(meteorite_landings, meteorite_stats): enriched_together=meteorite_landings.join( meteorite_stats, "class" ) greater_mass=enriched_together.withColumn( 'greater_mass', (enriched_together.mass > enriched_together.avg_mass_per_class) ) return greater_mass.filter("greater_mass")

Now, you can further analyze your resulting meteorite_enriched dataset by exploring it in Contour.

Apply your data transformation to multiple inputs

So far, you’ve created a dataset that contains all types of meteorites that have a greater-than-average mass. Let’s say you want to create separate datasets for each meteorite type. With Transforms Python, you can use a for loop to apply the same data transformation to each type of meteorite. For more information about applying the same data transformation to different inputs, refer to the section on Transform generation.

Create a new file in src/myproject/datasets and call it meteor_class.py. Note that you can continue writing your data transformation code in the meteor_analysis.py file, but this tutorial uses a new file to separate the data transformation logic.

To create separate datasets for each meteorite type, you will filter the meteorite_enriched dataset by class. Define a transform_generator function that applies this same data transformation logic to each meteorite type you want to analyze:

Copied!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 from transforms.api import transform_df, Input, Output def transform_generator(sources): transforms = [] for source in sources: @transform_df( # this will create a different output dataset for each meteorite type Output('/Users/jsmith/meteorite_{source}'.format(source=source)), my_input=Input('/Users/jsmith/meteorite_enriched') ) # "source=source" captures the value of the source variable in the scope of this function def filter_by_source(my_input, source=source): return my_input.filter(my_input["class"] == source) transforms.append(filter_by_source) return transforms # this will apply the data transformation logic from above to the three provided meteorite types TRANSFORMS = transform_generator(["L6", "H5", "H4"])

This will create a transformation that filters our meteorite dataset by class. Note that we must pass source=source into the filter_by_source function in order to capture the source parameter in the function’s scope.

Tip

For the initial data transformation created in the meteor_analysis.py file, you did not have to do any additional configuration to add Transforms to your project’s Pipeline. This is because the default Python project structure uses automatic registration to discover all Transform objects within your datasets folder.

To add also this final transformation to your project’s Pipeline using automatic registration, you must add the generated transforms to a variable as a list. In the example above, we used the variable TRANSFORMS. For more information about automatic registration and transforms generators, refer to the section on Transforms generation) in the Transforms Python documentation.