Code Workbook currently supports three languages: Python, R, and SQL.
The currently supported versions of Python in Code Workbook include Python 3.8 and Python 3.9. Python 2 is not supported, and environments using Python 2 will fail to resolve. We strongly recommend using one of the newer available versions of Python, as Palantir Foundry discontinues support for Python versions that are considered end-of-life ↗ by Python Developer documentation.
The currently supported versions of R include R 3.5, R 3.6, R 4.0, R 4.1 and R 4.2. The versions R 3.3 and R 3.4 are not supported, and their respective environments will fail to initialize.
The SQL variant supported in Code Workbook is Spark SQL ↗.
To enable a specific language on a Code Workbook profile, see the Conda Environment section of the Configuring Code Workbook profiles documentation. A series of examples for each of the supported languages are provided below, in the respective introductions to Python, R, and SQL.
Palantir Foundry will no longer support Python 3.6 and Python 3.7 after February 1, 2024. For Python 3.8 and higher, Foundry will follow the deprecation timelines defined by the Python Software Foundation (as seen in the version end of life table ↗), meaning that a Python version will not be supported in Foundry after end of life. For more information, see the documentation on Python version support.
Specific configuration is necessary for supported languages to function, as discussed in the sections below.
R is not yet available for self-service.
Two things must be true in order to have the ability to create an R transform in Code Workbook:
vector-spark-module-r
must be present in the environment currently used in the workbook. This can be achieved in either of the following ways:
vector-spark-module-r
package to the profile's environment.vector-spark-module-r
to the environment using the Add package dropdown menu.See Configuring Code Workbook profiles for more information.
The package vector-spark-module-py
must be present in the environment currently in use in the workbook. This can be achieved in either of the following ways:
vector-spark-module-py
package to the environment.vector-spark-module-py
to the environment using the Add package dropdown menu.See Configuring Code Workbook profiles for more information.
SQL transforms do not require any additional packages to function. As a result, SQL transforms will always be available by default for any given profile.
If you do not plan on using either Python or R on a given profile, consider removing the associated vector-spark-module
package to reduce your environment. You can always add them back when you need them.
A Python transform is defined as a Python function with any number of inputs, at most one output, and optionally one or more visualizations. By referencing a transform's alias as a function argument, Code Workbook will automatically pass as input of the transform the output of the mentioned alias. For more information about transforms in Code Workbook, consult the Transforms overview documentation.
A simple example of a Python transform could include a single PySpark DataFrame as input, transform the data using PySpark syntax, and have a transformed Spark DataFrame as output.
Copied!1 2 3
def child(input_spark_dataframe): from pyspark.sql import functions as F return input_spark_dataframe.filter(F.col('A') == 'value').withColumn('new_column', F.lit(2))
Within a Python transform, converting between Spark and Pandas dataframes is straightforward.
Copied!1 2 3 4
# Convert to PySpark spark_df = spark.createDataFrame(pandas_df) # Convert to pandas pandas_df = spark_df.toPandas()
Converting to pandas means collecting data to the driver. As a result, the size of the data is constrained by the available driver memory on the Spark module. If you are working with a large dataset, you may want to first filter and aggregate your data using Spark, then collect it into a pandas DataFrame.
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14
from pyspark.sql import functions as F def filtering_before_pandas(input_spark_dataframe): # Use PySpark to filter the data filtered_spark_df = input_spark_dataframe.select("name", "age").filter(F.col("age") <= 18) # Convert to a pandas df, which collects the data to the driver pandas_df = filtered_spark_df.toPandas() # Perform pandas operations mean_age = pandas_df["age"].mean() pandas_df["age_difference_to_mean"] = pandas_df["age"] - mean_age # Output the resulting DataFrame return pandas_df
To keep the order of a sorted pandas DataFrame after saving, save it as a Spark DataFrame with a single partition:
Copied!1 2 3
import pyspark.pandas as p return p.from_pandas(df).to_spark().coalesce(1)
An R transform is defined as an R function, with any number of inputs, at most one output, and optionally one or more visualizations. By referencing a transform's alias as a function argument, Code Workbook will automatically pass as input of the transform the output of the mentioned alias. For more information about transforms in Code Workbook, consult the transforms overview documentation.
A simple example of an R transform could include one parent R data.frame, transform the data using R, and have one R data.frame as output.
Copied!1 2 3 4 5
child <- def(r_dataframe) { library(tidyverse) new_df <- r_dataframe %>% dplyr::select(col_A, col_B) %>% dplyr::filter(col_A == true) %>% dplyr::mutate(new_column=1000) return(new_df) }
Within an R transform, converting between Spark DataFrames and R data.frame is straightforward:
Copied!1 2 3 4 5
# Convert from Spark DataFrame to R data.frame new_r_df <- SparkR::collect(spark_df) # Convert from R data.frame to Spark DataFrame spark_df <- SparkR::as.DataFrame(r_df)
Note that converting to an R data.frame means collecting data to the driver. As a result, the size of the data is constrained by the available driver memory on the Spark module. If you are working with a large dataset, you may want to first filter and aggregate your data using Spark, then collect it into an R data.frame.
Copied!1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
output_dataset <- function(spark_df) { library(tidyverse) # Use SparkR to filter the data input_dataset_filtered <- SparkR::select(spark_df, 'column_A', 'column_B') # Convert to R data.frame local_df <- SparkR::collect(input_dataset_filtered) # Use tidyverse functions to transform your data local_df <- local_df %>% dplyr::filter(column_A == true) %>% dplyr::mutate(new_column = 1000) # Output an R data.frame return(local_df) }
When an imported dataset in Code Workbook is read in as an R data.frame, the dataset is converted from a Spark DataFrame to an R data.frame by collecting to the driver.
SparkR::collect()
to convert it to an R data.frame. Alternately, use Python or SQL to transform your data into something smaller prior to using R.Long
, Array
, Map
, Struct
, and Datetime
types are not convertible. Consider dropping these columns or casting them to other data types (such as String
). You will receive a warning in the interface when attempting to read an input with these types as an R data.frame.R in Code Workbook is single-threaded, meaning only one R job can be run at a time on the same Spark module. If you initiate multiple R jobs at the same time, they will run serially; jobs that are queued will appear as "Queueing in Code Workbook."
The SQL variant supported in Code Workbook is Spark SQL ↗. The only supported input and output types are Spark DataFrames.
A simple example of a SQL transform could join two input DataFrames on a join key.
Copied!1 2 3
SELECT table_b.col_A, table_b.col_B, table_a.* FROM table_a JOIN table_b ON table_a.col_C == table_B.col_C
To add a parent to a SQL node, referencing the alias within the code is not sufficient. You must use the UI by selecting the input bar, or create the child node using the + button.