This content is also available at learn.palantir.com ↗ and is presented here for accessibility purposes.
The code in flight_alerts_clean.py
does the following:
assignee
and comment
) to be used in operational downstream workflows (e.g., we may want to assign flight alerts to an operator for action and comment).Learning PySpark is beyond the scope of this track; there are many online resources for learning code-based data transformation. Let’s briefly review the code in flight_alerts_clean.py
, however, to illustrate some basic best practices and as a springboard to PySpark resources available in the Foundry documentation. The text below extracts some (but not all) of the code's operationally interesting functions.
When links to documentation are provided in this task, select them and read the sections indicated.
DataFrames are Spark concepts we can only address superficially at this point in the training. In short, a DataFrame is a collection of rows under named columns. Line 14 of the code, for example, creates a DataFrame called priority_mapping
composed of aliased columns from priority_mapping_preprocessed
. Having defined this DataFrame, it can be called by name later in the transform.
Recall the import statement brings in the Functions
module from pyspark.sql. In several places, the code uses F.col
to enable reference to specific columns in the DataFrames. Read more about column references on this documentation page. That article also covers the use of withColumn
for adding arbitrary columns to your DataFrame.
In PySpark, select
operates in largely the same way as it does in SQL — it chooses specific columns from one DataFrame to include in another. If DataFrame_A
has ten columns but I only need three of them, I can use a .select()
to create DataFrame_B
:
DataFrame_B = DataFrame_A.select(‘col_1’, ‘col_2’, ‘col_3’)
As noted in this documentation entry, using .select()
to specify your dataset schema is a best practice you should use whenever possible.
Joins in PySpark join DataFrames in a similar way that joins in SQL join tables. Review our documentation on PySpark joins. Then read this section, which highlights some anti-patterns and suggested best practices for keeping your join behavior efficient.