This content is also available at learn.palantir.com ↗ and is presented here for accessibility purposes.
Repeat the steps from the previous tutorial for your passengers_preprocessed.py
transform.
Create a new Python files in your /preprocessed
folder called passengers_preprocessed.py
.
Open your passengers_preprocessed.py
file and replace the default code with the code below.
from transforms.api import transform, Input, Output
from transforms.verbs.dataframes import sanitize_schema_for_parquet
@transform(
parsed_output=Output("/${space}/Temporary Training Artifacts/${yourName}/Data Engineering Tutorials/Datasource Project: Passengers/datasets/preprocessed/passengers_preprocessed"),
raw_file_input=Input("${passengers_json_raw_RID}"),
)
def read_json(ctx, parsed_output, raw_file_input):
# Create a variable for the filesystem of the input datasets
filesystem = raw_file_input.filesystem()
# Create a variable for the hadoop path of the files in the input dataset
hadoop_path = filesystem.hadoop_path
# Create an array of the absolute path of each file in the input dataset
paths = [f"{hadoop_path}/{f.path}" for f in filesystem.ls()]
# Create a Spark dataframe from all of the JSON files in the input dataset
df = ctx.spark_session.read.json(paths)
"""
Write the dataframe to the output dataset, using the sanitize_schema_for_parquet function
to make sure that the column names don't contain any special characters that would break the
output parquet file
"""
parsed_output.write_dataframe(sanitize_schema_for_parquet(df))
Replace the following:
${space}
on line 6 with your space.${yourName}
on line 6 with your Tutorial Practice Artifacts folder name.${passengers_json_raw_RID}
on line 7 with the RID of the passengers_json_raw
dataset, which is the output defined in passengers_raw.py
.Using the Preview button in the top right, test your code to ensure the output appears as a dataset rather than a raw file.
If your testing works as expected, commit your code with a meaningful message, e.g., “feature: add passengers_preprocessed”.
Build your passengers_preprocessed.py
code on your feature branch and confirm the output dataset contains a two-column mapping of passengers to flight alerts.
Consider swapping the input/output transform paths with the associated RIDs.
Once the builds have successfully completed, merge our feature branch into Master
using the PR processes described in previous tutorials.
Finally, build both preprocessed outputs on the Master
branch.