15 - Creating Type Utilities

This content is also available at learn.palantir.com ↗ and is presented here for accessibility purposes.

📖 Task Introduction

In this task, you’ll continue addressing data normalization measures, this time specifically the schema or types. For this, you’ll rename and repurpose the utils.py file that the repository provided when it was created.

🔨 Task Instructions

Right click on the utils.py file in the repository's Files panel and rename it to type_utils.py.
Delete the contents of the file in the code editor window (e.g., with ctrl+a → Delete).

Copy the code block below, and paste it into the code editor.

Note how the functionality described in the code comments addresses the schema/type issues described above.


from pyspark.sql import functions as F
from pyspark.sql.types import StringType


def cast_to_string(df, string_columns):
    """
    This function takes a dataframe (df) and an array of columns as arguments
    This function iterates through the list of columns in the dataframe and
    converts them to string types
    """
    for colm in string_columns:
        df = df.withColumn(colm, F.col(colm).cast(StringType()))
    return df


def cast_to_date(df, string_columns, date_format):
    """
    This function takes a dataframe (df), an array of string columns, and a date format (string) as arguments
    This function iterates through the list of string columns in the dataframe and
    converts them to date types based on the specified date format
    Example date format: "MM-dd-yyyy"
    """
    for colm in string_columns:
        df = df.withColumn(colm, F.to_date(F.col(colm), date_format))
    return df

Commit your code with the following message: “feature: add type utils.”