F.date_add(start, days)
F.date_sub(start, days)
F.add_months(start, months)
F.datediff(end, start)
F.months_between(date1, date2)
F.last_day(date)
F.next_day(date, dayOfWeek)
F.year(column)
F.month(column)
F.dayofmonth(column)
F.hour(column)
F.minute(column)
F.second(column)
F.quarter(column)
F.dayofyear(column)
F.weekofyear(column)
Here's a quick reference:
Format | Example |
---|---|
yyyy-MM-dd | 1997-01-31 |
yyyy-MM-dd HH:mm | 1997-01-31 23:59:59 |
Date formatting string patterns are based on the Java class java.text.SimpleDateFormat. The complete reference is available in the Date & Time Format Syntax Table ↗.
F.to_date(column, format=None)
F.to_timestamp(column, format=None)
F.to_utc_timestamp(timestamp, tz)
F.unix_timestamp(timestamp=None, format='yyyy-MM-dd HH:mm:ss')
F.date_format(date, format)
F.from_unixtime(timestamp, format='yyyy-MM-dd HH:mm:ss')
F.from_utc_timestamp(timestamp, tz)
long
to timestamp
Some systems store timestamps as a long
datatype, in milliseconds. PySpark SQL stores timestamps in seconds. We must divide the long
version of the timestamp by 1000 to properly cast it to timestamp
:
Copied!1 2 3
casted_timestamp = (F.col('timestamp') / 1000).cast("timestamp") df = df.withColumn("timestamp", casted_timestamp) # 1531860192661 => Tuesday, July 17, 2018 8:43:12 PM
We can also use F.from_unixtime(timestamp)
for clarity:
Copied!1 2
timestamp = F.from_unixtime(F.col('timestamp') / 1000) df = df.withColumn("timestamp", timestamp)
When casting from long
to timestamp
, we lose a level of granularity. SQL cannot store percentages or decimals of seconds.
F.trunc(date, format)
F.date_trunc(format, timestamp)