5A. [Repositories] Working with Raw Files in Code Repositories8. Preprocess Your Data Part 2

8 - Preprocess Your Data, Part 2

This content is also available at learn.palantir.com ↗ and is presented here for accessibility purposes.

📖 Task Introduction

Repeat the steps from the previous tutorial for your passengers_preprocessed.py transform.

🔨 Task Instructions

  1. Create a new Python files in your /preprocessed folder called passengers_preprocessed.py.

  2. Open your passengers_preprocessed.py file and replace the default code with the code below.

    from transforms.api import transform, Input, Output
    from transforms.verbs.dataframes import sanitize_schema_for_parquet
    
    
    @transform(
        parsed_output=Output("/${space}/Temporary Training Artifacts/${yourName}/Data Engineering Tutorials/Datasource Project: Passengers/datasets/preprocessed/passengers_preprocessed"),
        raw_file_input=Input("${passengers_json_raw_RID}"),
    )
    def read_json(ctx, parsed_output, raw_file_input):
    
        # Create a variable for the filesystem of the input datasets
        filesystem = raw_file_input.filesystem()
    
        # Create a variable for the hadoop path of the files in the input dataset
        hadoop_path = filesystem.hadoop_path
    
        # Create an array of the absolute path of each file in the input dataset
        paths = [f"{hadoop_path}/{f.path}" for f in filesystem.ls()]
    
        # Create a Spark dataframe from all of the JSON files in the input dataset
        df = ctx.spark_session.read.json(paths)
    
        """
        Write the dataframe to the output dataset, using the sanitize_schema_for_parquet function
        to make sure that the column names don't contain any special characters that would break the
        output parquet file
        """
        parsed_output.write_dataframe(sanitize_schema_for_parquet(df))
    
  3. Replace the following:

    • Replace the ${space} on line 6 with your space.
    • Replace the ${yourName} on line 6 with your Tutorial Practice Artifacts folder name.
    • Replace ${passengers_json_raw_RID} on line 7 with the RID of the passengers_json_raw dataset, which is the output defined in passengers_raw.py.
  4. Using the Preview button in the top right, test your code to ensure the output appears as a dataset rather than a raw file.

  5. If your testing works as expected, commit your code with a meaningful message, e.g., “feature: add passengers_preprocessed”.

  6. Build your passengers_preprocessed.py code on your feature branch and confirm the output dataset contains a two-column mapping of passengers to flight alerts.

  7. Consider swapping the input/output transform paths with the associated RIDs.

  8. Once the builds have successfully completed, merge our feature branch into Master using the PR processes described in previous tutorials.

  9. Finally, build both preprocessed outputs on the Master branch.