2. [Repositories] Introduction to Data Transformations15. Creating Type Utilities

15 - Creating Type Utilities

This content is also available at learn.palantir.com ↗ and is presented here for accessibility purposes.

📖 Task Introduction

In this task, you’ll continue addressing data normalization measures, this time specifically the schema or types. For this, you’ll rename and repurpose the utils.py file that the repository provided when it was created.

🔨 Task Instructions

  1. Right click on the utils.py file in the repository's Files panel and rename it to type_utils.py.

  2. Delete the contents of the file in the code editor window (e.g., with ctrl+a → Delete).

  3. Copy the code block below, and paste it into the code editor.

    Note how the functionality described in the code comments addresses the schema/type issues described above.

    
    from pyspark.sql import functions as F
    from pyspark.sql.types import StringType
    
    
    def cast_to_string(df, string_columns):
        """
        This function takes a dataframe (df) and an array of columns as arguments
        This function iterates through the list of columns in the dataframe and
        converts them to string types
        """
        for colm in string_columns:
            df = df.withColumn(colm, F.col(colm).cast(StringType()))
        return df
    
    
    def cast_to_date(df, string_columns, date_format):
        """
        This function takes a dataframe (df), an array of string columns, and a date format (string) as arguments
        This function iterates through the list of string columns in the dataframe and
        converts them to date types based on the specified date format
        Example date format: "MM-dd-yyyy"
        """
        for colm in string_columns:
            df = df.withColumn(colm, F.to_date(F.col(colm), date_format))
        return df
    
  4. Commit your code with the following message: “feature: add type utils.”