Code Workbook allows users to use both Spark R and native R. Spark R provides a distributed data frame implementation that supports operations like selection, filtering, and aggregation on large datasets. While users may be more familiar with native R, it is recommended that users first use SparkR to filter large datasets before using native R.
Read the full API documentation ↗ for SparkR to see all possible operations. Below, we outline syntax for common operations.
Filter experessions can be a SQL-like WHERE clause passed as a string.
Copied!1
df_filtered <- SparkR::filter(df, "numeric_col > 10")
You can also use column expressions similar to standard R syntax.
Copied!1
df_filtered <- SparkR::filter(df, df$numeric_col > 10)
You can also use SparkR::where
with similar syntax.
Copied!1
df_filtered <- SparkR::where(df, "numeric_col > 10")
Subset columns using SparkR::select()
.
Copied!1
df_subset <- SparkR::select(df, "column1", "column2", "column3")
Rename a column with SparkR::withColumnRenamed()
.
Copied!1
df <- SparkR::withColumnRenamed(df, "old_column_name", "new_column_name")
Add new columns using SparkR::withColumn()
.
Copied!1 2 3 4
# Add two columns df <- SparkR::withColumn(df, 'col1_plus_col2', df$col1 + df$col2) # Multiply a column by a constant df <- SparkR::withColumn(df, 'col1_times_60', df$col1 * 60)
Use SparkR::groupBy
and SparkR::agg
to compute aggregates. Calling SparkR::groupBy
will create a group by object. Pass the group by object into SparkR::agg
to get an aggregated dataframe.
Copied!1 2
df_grouped <- SparkR::groupBy(df, "group_col1", "group_col2") df_agg <- SparkR::agg(df_grouped, average_col1=avg(df$col1), max_col=max(df$col1))