13 - Create clean “passengers” output datasets, part 2

This content is also available at learn.palantir.com ↗ and is presented here for accessibility purposes.

🔨 Task Instructions

Add a new Python file in the new clean folder called passengers_clean.py and replace the default content with the code block below.

from transforms.api import transform_df, Input, Output
from pyspark.sql import functions as F
from cleaning_functions import type_utils as types, cleaning_utils as clean


@transform_df(
    Output("/${space}/Temporary Training Artifacts/${yourName}/Data Engineering Tutorials/Datasource Project: Passengers/data/clean/passengers_clean"),
    source_df=Input("${passengers_preprocessed_RID}"),
)
def compute(source_df):

    # define string columns to be normalized
    normalize_string_columns = [
        'first_name',
        'last_name',
        'flyer_status',
    ]

    # define columns to be cast to dates
    cast_date_columns = [
        'dob',
    ]

    # cast columns to appropriate types using functions from our utils files
    typed_df = types.cast_to_date(source_df, cast_date_columns, "MM/dd/yy")

    # normalize strings and column names using functions from our utils files
    normalized_df = clean.normalize_strings(typed_df, normalize_string_columns)
    normalized_df = clean.normalize_column_names(normalized_df)

    # select columns in the order we want and rename where appropriate
    normalized_df = normalized_df.select(
        'passenger_id',
        'first_name',
        'last_name',
        'country',
        F.col('dob').alias('date_of_birth'),
        'flyer_status',
    )

    return normalized_df

Replace the following:
- Replace the ${space} with your space.
- Replace the ${yourName} with your /Tutorial Practice Artifacts folder name.
- Replace ${passengers_preprocessed_RID} with the RID of the passengers_preprocessed dataset, which is the output defined in passengers_preprocessed.py.
  
  ℹ️ If Code Assist is underlining your import statements in red, open your repository’s meta.yml file and select the Refresh Code Assist dependencies link at the top of the editor.
Use the Preview button to see a sampled output of your transform.

ℹ️ Remember that in the Preview helper, you can quickly toggle between a sample of the input and a sample of the output to compare results by clicking alternately between the INPUTS and OUTPUTS blocks on the left side of the helper.

Create another new file in the /datasets folder called passenger_flight_alerts_clean.py file, replacing the default code with the code block below.

from transforms.api import transform_df, Input, Output
from cleaning_functions import cleaning_utils as clean


@transform_df(
    Output("/${space}/Temporary Training Artifacts/${yourName}/Data Engineering Tutorials/Datasource Project: Passengers/data/clean/passenger_flight_alerts_clean"),
    source_df=Input("${passenger_flight_alerts_preprocessed_RID}"),
)
def compute(source_df):

    # normalize column names using functions from our utils files
    normalized_df = clean.normalize_column_names(source_df)

    return normalized_df

Repeat steps 2 & 3 for your passenger_flight_alerts_clean.py transform file. For the {$passenger_flight_alerts_preprocessed_RID} placeholder, refer to the passenger_flight_alerts_preprocessed.py transform file in your repository to obtain the output RID.
Commit your new code to your branch with a meaningful message (e.g., “feature: add clean outputs”).
Built your clean datasets on your branch and confirm the data is normalized and formatted appropriately. You may want to compare, for example, the passengers_preprocessed dataset (on Master) with passengers_clean on your branch.
Consider replacing the Input paths with the RIDs using the Replace paths with RIDs link in your code files. Remember that this will necessitate a new commit, since it is a change to your code.
If your builds succeed on your branch, create a PR and merge your branch into Master.
Build your clean outputs on the Master branch.