This content is also available at learn.palantir.com ↗ and is presented here for accessibility purposes.
In this task, you’ll continue addressing data normalization measures, this time specifically the schema or types. For this, you’ll rename and repurpose the utils.py
file that the repository provided when it was created.
Right click on the utils.py
file in the repository's Files panel and rename it to type_utils.py
.
Delete the contents of the file in the code editor window (e.g., with ctrl+a → Delete).
Copy the code block below, and paste it into the code editor.
Note how the functionality described in the code comments addresses the schema/type issues described above.
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
def cast_to_string(df, string_columns):
"""
This function takes a dataframe (df) and an array of columns as arguments
This function iterates through the list of columns in the dataframe and
converts them to string types
"""
for colm in string_columns:
df = df.withColumn(colm, F.col(colm).cast(StringType()))
return df
def cast_to_date(df, string_columns, date_format):
"""
This function takes a dataframe (df), an array of string columns, and a date format (string) as arguments
This function iterates through the list of string columns in the dataframe and
converts them to date types based on the specified date format
Example date format: "MM-dd-yyyy"
"""
for colm in string_columns:
df = df.withColumn(colm, F.to_date(F.col(colm), date_format))
return df
Commit your code with the following message: “feature: add type utils.”