Aggregation and pivot tables

Aggregation Syntax

There are a number of ways to produce aggregations in PySpark. We recommend this syntax as the most reliable.

Copied!
1 2 3 aggregated_df = df.groupBy('state').agg( F.max('city_population').alias('largest_city_in_state') )
statecitycity_population
WABellevue100000
WASeattle700000
PAPhiladelphia1500000
PAPittsburgh300000
statelargest_city_in_state
PA1500000
WA700000

Examples

The .groupBy() method on a DataFrame takes an arbitrary number of columns over which to perform the aggregations. Alternatively, to aggregate across the whole DataFrame, include no columns.

Copied!
1 2 3 4 5 6 7 aggregated_df = df.groupBy('state', 'county').agg( F.max('city_population').alias('largest_city_in_state_county') ) aggregated_df = df.groupBy().agg( F.max('city_population').alias('largest_city_overall') )

The .agg() method on a grouped DataFrame takes an arbitrary number of aggregation functions.

Copied!
1 2 3 4 aggregated_df = df.groupBy('state').agg( F.max('city_population').alias('largest_city_in_state'), F.avg('city_population').alias('average_population_in_state') )

By default aggregations produce columns of the form aggregation_name(target_column). However, column names in Foundry cannot contain parentheses or other non-alphanumeric characters. Alias each aggregation to a specific name instead.

Pivot Tables

Pivot tables in PySpark work very similarly to ordinary grouped aggregations.

Copied!
1 pivoted_df = df.groupBy('equipment').pivot('sensor').mean('value')
equipmentsensorvalue
Atemperature60
Atemperature40
Bspeed6
Aspeed3
equipmenttemperaturespeed
A503
Bnull7

Aggregation Functions

Learn more about PySpark aggregate functions. ↗

avg(column) / mean(column)

collect_list(column)

  • Combine all values into an array

collect_set(column)

  • Combine all values into an array with duplicates removed

count(column)

corr(x, y)

  • Pearson Correlation Coefficient for columns x and y.

covar_pop(col1, col2)

covar_samp(col1, col2)

countDistinct(column, *cols)

first(column, ignorenulls=False)

  • First value of the column in the group. Useful for Pivot tables for which we expect only one value to exist but must choose an aggregation anyway.

grouping(column)

grouping_id(*cols)

kurtosis(column)

last(column, ignorenulls=False)

max(column)

min(column)

skewness(column)

stddev(column)

stddev_pop(column)

  • Population standard deviation

stddev_samp(column)

  • Unbiased sample standard deviation

sum(column)

sumDistinct(column)

var_pop(column)

  • Population variance

var_samp(column)

  • Unbiased sample variance

variance(column)