There are a number of ways to produce aggregations in PySpark. We recommend this syntax as the most reliable.
Copied!1 2 3
aggregated_df = df.groupBy('state').agg( F.max('city_population').alias('largest_city_in_state') )
state | city | city_population |
---|---|---|
WA | Bellevue | 100000 |
WA | Seattle | 700000 |
PA | Philadelphia | 1500000 |
PA | Pittsburgh | 300000 |
state | largest_city_in_state |
---|---|
PA | 1500000 |
WA | 700000 |
The .groupBy()
method on a DataFrame
takes an arbitrary number of columns over which to perform the aggregations. Alternatively, to aggregate across the whole DataFrame
, include no columns.
Copied!1 2 3 4 5 6 7
aggregated_df = df.groupBy('state', 'county').agg( F.max('city_population').alias('largest_city_in_state_county') ) aggregated_df = df.groupBy().agg( F.max('city_population').alias('largest_city_overall') )
The .agg()
method on a grouped DataFrame
takes an arbitrary number of aggregation functions.
Copied!1 2 3 4
aggregated_df = df.groupBy('state').agg( F.max('city_population').alias('largest_city_in_state'), F.avg('city_population').alias('average_population_in_state') )
By default aggregations produce columns of the form
aggregation_name(target_column)
. However, column names in Foundry cannot contain parentheses or other non-alphanumeric characters. Alias each aggregation to a specific name instead.
Pivot tables in PySpark work very similarly to ordinary grouped aggregations.
Copied!1
pivoted_df = df.groupBy('equipment').pivot('sensor').mean('value')
equipment | sensor | value |
---|---|---|
A | temperature | 60 |
A | temperature | 40 |
B | speed | 6 |
A | speed | 3 |
equipment | temperature | speed |
---|---|---|
A | 50 | 3 |
B | null | 7 |
Learn more about PySpark aggregate functions. ↗
avg(column)
/ mean(column)
collect_list(column)
collect_set(column)
count(column)
corr(x, y)
x
and y
.covar_pop(col1, col2)
covar_samp(col1, col2)
countDistinct(column, *cols)
first(column, ignorenulls=False)
grouping(column)
grouping_id(*cols)
kurtosis(column)
last(column, ignorenulls=False)
max(column)
min(column)
skewness(column)
stddev(column)
stddev_pop(column)
stddev_samp(column)
sum(column)
sumDistinct(column)
var_pop(column)
var_samp(column)
variance(column)