Documentation

Developer toolchainPalantir extension for Visual Studio Code [Beta]Preview transforms

Preview transforms

The Palantir extension for Visual Studio Code allows you to preview Python transforms directly from your local Visual Studio Code environment or a VS Code Workspace in the Palantir platform. This capability enables rapid testing of transforms without the need to exit the code editor. Currently, this feature is only available for Python transforms.

Initiate a preview

You can start a preview within your local Visual Studio Code environment or an in-platform VS Code Workspace in the following four ways:

Select the Run preview command from the Command Palette.

Find the "Run preview" command in the Command Palette.

Select the Run preview icon from the toolbar.

The "Run preview" icon, which is a fast-forward symbol, from the VS Code toolbar.

Select Preview above the transform.

Open the Preview panel and select the Preview button next to the code filename.

Preview process

The Palantir extension for Visual Studio Code runs local preview using the Preview Engine. This Preview Engine downloads and temporarily stores parts of datasets to a user's machine if they have the appropriate permissions for the data.

To use preview during local development, local preview must be enabled by your platform administrator from the Code Repositories settings page in Control Panel.

Upon opening a Palantir repository, the extension will configure the environment. Once the environment is set up and transforms are detected, you will be able to execute previews locally.

Inside Code Repositories, we use Code Assist to run preview. The following sections compare the two preview modes.

Comparing Preview modes

Code Assist preview and Preview Engine preview use different execution models. Code Assist preview uses a preview version of the transforms library, which is a close re-implementation of the actual transforms library used during in-platform builds. This results in broader feature support at the cost of precision. There are subtle implementation differences between the preview and build versions of the transforms library which can lead to non-intuitive and sometimes misleading preview results.

On the other hand, Preview Engine uses the original transforms library to execute the user code. This way, the fact that the transformation is running in preview mode should be barely perceivable to the underlying code, resulting in higher accuracy and performance. The main drawback is that support has to be added for each library primitive lower down the architecture resulting in fewer supported features at the time of writing.

Sample-less vs. sampled dataset loading in Preview Engine

Preview Engine features a sample-less dataset loading option. To understand its importance, consider the input loading method of both Code Assist preview and Preview Engine preview. When an input dataset is requested, a certain subset of the input dataset is downloaded to a disk before preview is actually run. The subset is uniformly sampled from the input, and the number of rows can be configured by the user with a default of 10,000. In some use cases, this sampling is adequate and does not introduce statistical bias. However, for certain transformations, such as narrow filters or joins between multiple inputs, the result can be deceivingly short as matching values for the filtering expressions is less likely and exponentially less likely (in the number of joined inputs) for join expressions.

In the case of sample-less dataset loading, there is no pre-sampling happening. Instead, Preview Engine relies on modern data processing engines, such as Spark or Polars, to push down predicates ↗ to the data-source level and only download chunks of the dataset that are most likely to match the query. This means that filters or other narrowing expressions used anywhere within the transform code may be eligible to be pushed down, resulting in fully accurate preview results without much extra computational time incurred.

Some pipelines cannot take full advantage of predicate push-down, for example, pipelines that do not contain filter expressions. In these cases, the pipeline's author can introduce some conditional filter expressions in their code to speed up their preview runs during development.

There is one more caveat to keep in mind when deciding to go with sampled or sample-less dataset loading for a given input. Sampled inputs are cached locally, on disk, while caching is not supported for sample-less loading. This means if an input is not used in join expressions and the statistical properties of filters applied to this input are less relevant to the pipeline preview's correctness, sampled dataset loading is the better choice for speedier previews. In all other cases, sample-less dataset loading should be preferred.

To choose a strategy, select the Configure input strategy button in the Preview panel and choose between Sampled or Full dataset options.

Code-defined input filtering

Apart from the Sampled and Full dataset input strategy configurations discussed above, VS Code preview also supports Code-defined filters. This option allows you to specify your own custom filtering strategy implemented directly in your code. When applicable, the custom filtering strategies will leverage push down predicates ↗ to ensure that only the most relevant data samples are used in preview. Structured inputs in Spark and lightweight transforms are supported, as well as unstructured inputs, such as raw files, for Spark. You can select any eligible function in your repository from the multi-select dropdown menu in any order you prefer. Filters will be applied in the order they appear in the selection box. The VS Code extension will automatically discover all eligible filters anywhere in the project codebase and they will be shown in the selection dropdown.

Configure code-defined input filtering.

The rules for making a Python function an eligible preview filter are:

Valid Python functions directly defined in the global module scope. These must satisfy the following rules:
- No nested functions
- Not guarded by if, with, for or other statements
- Not part of a class
- Not imported from somewhere else
- No variables assigned to functions
No async or private functions (no function names starting with _) are allowed
Functions without any decorators applied on them
Fully type-annotated with one of the following annotations:
- (pyspark.sql.DataFrame) -> pyspark.sql.DataFrame: For Spark transforms
- (polars.LazyFrame) -> polars.LazyFrame: For lightweight transforms
- (collections.abc.Iterator[transforms.api._file_system.FileStatus]) -> collections.abc.Iterator[transforms.api._file_system.FileStatus]: For raw files

Here is a simple example list of eligible functions that can be used for preview filters:

Copied!1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from pyspark.sql import DataFrame
from pyspark.sql import functions as F
from collections.abc import Iterator
from transforms.api._file_system import FileStatus
import itertools as it
import polars as pl


def limit_files(files: Iterator[FileStatus]) -> Iterator[FileStatus]:
    """ Limit the number of files in a file system listing."""
    return it.islice(files, 10)

def water_pokemons_only(df: DataFrame) -> DataFrame:
    """ Get only Water types of pokemons """
    return df.filter(F.col("Type_1") == "Water")

def grass_pokemons_only_lightweight(df: pl.LazyFrame) -> pl.LazyFrame:
    """ Get only Grass types of pokemons """
    return df.filter(pl.col("Type_1") == "Grass")

You can receive immediate feedback on the eligibility of your functions as code-defined preview filter functions through the codelens hint above the functions.

Valid preview filter CodeLens

Supported features by different preview methods

The following table shows the current support matrix of different preview executors. Code Repositories Preview is used not only in Code Repositories but also in the Remote preview mode of the Visual Studio Code extension. When previewing in Local mode, users can choose to use Full dataset (which is the same as sample-less), Sampled, or Code-defined filters (which applies user-defined filters to sample-less) dataset loading modes.

	Code Repositories (Code Assist)	Sample-less preview (Preview Engine)	Sampled preview (Preview Engine)
Debugging	Supported	Supported	Supported
Foundry datasets	Both tabular (with schema) and raw files	Only tabular datasets	Both tabular (with schema) and raw files
Transform generators	Supported	Supported	Supported
Data expectations	Spark and lightweight transforms	Supported for Spark transforms	Supported for Spark transforms
Lightweight transforms	Supported	Supported for Parquet datasets	Supported
Views and object materializations	Supported	Not supported	Supported
Incrementality	Minimal support	Supported	Minimal support
External transforms	Supported	Supported in Code Workspaces	Supported in Code Workspaces
Column statistics and filtering	Supported	Supported	Supported
Media sets	Supported	Not supported	Not supported
Models	Supported	Supported for Spark transforms^[1]	Supported for Spark transforms^[1]
Spark profiles	Supported	Supported for some Spark configurations	Supported for some Spark configurations
Cipher	Supported	Not supported	Not supported
Language models	Supported	Not supported	Not supported
Virtual tables	Supported	Not supported	Not supported
Spark sidecars	Not supported	Not supported	Not supported
Complex input sampling	Supported	Code-defined (tabular and raw files)	Not supported
Preview variables during debugging	Supported	Not supported	Not supported

[1] Model input and output preview is supported in both VS Code workspaces and local development. Model input preview is only supported in VS Code workspaces.

In both local development and VS Code workspaces, if sample-less dataset loading is used for a transformation's preview but the transformation also makes use of unsupported features, the preview will fall back to sampled dataset loading. This behavior is indicated in the preview panel.

Spark profiles

Spark profiles allow users to quickly define and use custom spark configuration values to specify the behavior of the Spark engine while previewing or building the transform. Review the documentation on Spark profiles for more details.

VS Code preview applies some configurations from the previewed transform's Spark profiles. This includes configurations that affect the runtime behavior of the execution engine, most often for maintaining backward compatibility during breaking changes. It is not possible to change the resources allocated for preview through Spark profiles; this can be changed separately on the Code Workspaces settings page.

Both built-in and user-defined Spark profiles are supported during preview. Options omitted from the list below are ignored:

spark.sql.legacy.timeParserPolicy
spark.sql.parquet.datetimeRebaseModeInRead
spark.sql.legacy.parquet.datetimeRebaseModeInRead
spark.sql.legacy.parquet.datetimeRebaseModeInWrite
spark.sql.analyzer.failAmbiguousSelfJoin
spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue
spark.sql.legacy.fromDayTimeString.enabled
spark.sql.legacy.typeCoercion.datetimeToString.enabled
spark.sql.legacy.followThreeValuedLogicInArrayExists
spark.sql.legacy.allowUntypedScalaUDF
spark.sql.legacy.allowNegativeScaleOfDecimal
spark.sql.legacy.allowHashOnMapType
spark.sql.legacy.avro.datetimeRebaseModeInRead
spark.sql.legacy.avro.datetimeRebaseModeInWrite
spark.sql.legacy.charVarcharAsString
spark.sql.optimizer.collapseProjectAlwaysInline
spark.foundry.sql.allowAddMonths
spark.sql.parquet.int96AsTimestamp

External transforms

External transforms in Code Workspaces enforce strict export controls. The Code Workspaces application maintains a historical record of a workspace's inputs, so previous inputs that contain additional security markings may stop a preview due to marking violations. Additionally, the application accounts for all previously incorporated container markings when a workspace computes its marking security checks and export controls to avoid the inappropriate exposure of marked data.

If a workspace contains markings that are incompatible with an external transform, restart the workspace without checkpoints to clear tracked markings. Review the external transforms documentation for additional information.

Accurate incremental preview

The Python transforms VS Code integration allows for accurate previewing of incremental transformations. Incremental transformations follow complex evaluation logic to decide when a given transform can run incrementally and what sets of input and output rows should be read in that case. To better understand incremental resolution and evaluation logic, we recommend exploring it in a VS Code preview.

When running a preview in VS Code, the exact same logic is run for incremental resolution and evaluation, as would be the case during a build. Running a preview will produce the same results as a build (ignoring sampling) if it was triggered at the same point in time.

Accurate incremental preview is only supported for Spark transforms using incremental V1 semantics. If V2 semantics are enabled through setting the v2_semantics=True argument, this is ignored and V1 semantics are applied instead.

Incremental preview in VS Code for Lightweight transforms always results in a snapshot-mode run even if the build will run incrementally. This behavior is the same in Code Repositories.

After an incremental preview run, the preview panel will show the incremental resolution results in the UI. If the transform has been run incrementally, a tag displaying Ran incrementally will appear, as shown in the example below.

Preview panel after an incremental run.

In some cases, a condition can prevent the transform from running incrementally, both during preview and during a build. This could be caused by a change to the semantic_version parameter, or certain transaction types on the inputs, among other reasons. The reason for the non-incremental (snapshot) run will be shown when hovering over the Ran as a snapshot tag, as shown below.

Preview panel after a snapshot-mode incremental run.