3. [Repositories] Creating a Project Output4. Brief Code Review

4 - Brief Code Review

This content is also available at learn.palantir.com ↗ and is presented here for accessibility purposes.

The code in flight_alerts_clean.py does the following:

  1. Takes the two preprocessed mapping datasets and temporarily renames their columns.
  2. Joins the preprocessed flight alerts dataset to the two mapping tables to map the status and priority codes from numbers to text strings.
  3. Selects columns for the output dataset.
  4. Adds two empty string columns (assignee and comment) to be used in operational downstream workflows (e.g., we may want to assign flight alerts to an operator for action and comment).

Learning PySpark is beyond the scope of this track; there are many online resources for learning code-based data transformation. Let’s briefly review the code in flight_alerts_clean.py, however, to illustrate some basic best practices and as a springboard to PySpark resources available in the Foundry documentation. The text below extracts some (but not all) of the code's operationally interesting functions.

When links to documentation are provided in this task, select them and read the sections indicated.

DataFrame

DataFrames are Spark concepts we can only address superficially at this point in the training. In short, a DataFrame is a collection of rows under named columns. Line 14 of the code, for example, creates a DataFrame called priority_mapping composed of aliased columns from priority_mapping_preprocessed. Having defined this DataFrame, it can be called by name later in the transform.

F.col() & .withColumn()

Recall the import statement brings in the Functions module from pyspark.sql. In several places, the code uses F.col to enable reference to specific columns in the DataFrames. Read more about column references on this documentation page. That article also covers the use of withColumn for adding arbitrary columns to your DataFrame.

.select()

In PySpark, select operates in largely the same way as it does in SQL — it chooses specific columns from one DataFrame to include in another. If DataFrame_A has ten columns but I only need three of them, I can use a .select() to create DataFrame_B: DataFrame_B = DataFrame_A.select(‘col_1’, ‘col_2’, ‘col_3’)

As noted in this documentation entry, using .select() to specify your dataset schema is a best practice you should use whenever possible.

.join()

Joins in PySpark join DataFrames in a similar way that joins in SQL join tables. Review our documentation on PySpark joins. Then read this section, which highlights some anti-patterns and suggested best practices for keeping your join behavior efficient.