With Pipeline Builder, you can load, transform, and wield geospatial data. If your geospatial workflow is not yet supported by Pipeline Builder's current capabilities, consult Foundry’s legacy geospatial documentation to transform your data in Code Repositories.
Pipeline Builder models geospatial data internally using the concept of a logical type, which is a base type (string, integer, boolean, array, struct) with additional constraints on the data represented. For example, the Geometry type is defined as a string which must be valid GeoJSON, while a GeoPoint must be a struct of longitude between -180
and 180
and latitude between -90
and 90
, both inclusive. A full list of supported types can be found below.
All logical types in Pipeline Builder are inheritors of their base types; for instance, a geometry can be used as input to an expression which expects an input of type string, but not vice versa. To cast from a base type to a particular logical type which extends that base type, you can use the “Logical Type Cast” expression, which will apply the constraints associated with that logical type to the data and null any values which fail this validation. The ability for expressions to specify logical types as input and output ensures that when a geospatial-specific expression expects a GeoJSON string, a GeoJSON string will be received.
Pipeline Builder currently supports the following geospatial types:
longitude
and latitude
, where longitude
is a double between -180
and 180
, and latitude
is a double between -90
and 90
, both inclusive. A GeoPoint must be a valid (x, y) coordinate according to the WGS:84 or EPSG:4326 coordinate reference system (CRS).minLat
, minLon
, maxLat
, maxLon
, where each entry is a valid GeoPoint and where maxLat > minLat
and maxLon > minLon
.{lat},{lon}
, where -90 <= lat <= 90
and -180 <= lon <= 180
.Pipeline Builder supports a variety of different transforms and expressions for geospatial data.
lat,lon
pair, validates the bounds outlined above, and converts it into GeoPoint representation.x,y
pair and a coordinate reference system, projects that (x,y)
into WGS:84, then constructs a GeoPoint representation. Supports conversion from most coordinate systems in the EPSG database, including all UTM zones.Additional expressions exist to translate between the above two types, as well as to convert them to H3 indices, MGRS, bounding boxes, and the Ontology GeoPoint format.
Once you have populated your columns of Pipeline Builder’s geospatial types, you can take advantage of transforms that operate specifically on geospatial data. Most transforms (except for geo-joins) are currently supported in both streaming and batch workflows. Some highlights are listed below.
Pipeline Builder supports the following geospatial joins:
Pipeline Builder's geometry intersection join requires two datasets, each of which must have a geometry typed column. The geometry intersection join does not accept Ontology GeoPoint or GeoPoint as an input type. Before applying the join, we recommend normalizing the geometry column and explicitly filtering out null
values if they are not needed in the output. If there is non-determinism or another join in the pipeline, we recommend adding a checkpoint prior to the geojoin.
Pipeline Builder can join datasets of medium-sized geometries (approximately up to 34 points) with a scale of up to 1 million rows on either side, assuming a twofold increase in the number of output rows. For skewed data, the join can support up to 250 million rows on one side against 1.6 thousand rows on the other. Stability may degrade as the size of the geometries increases. The join can consistently support joining a dataset with one massive geometry (on the order of 40k points) against up to 500k rows. Any larger scale may succeed intermittently but is not officially supported.
Geometry intersection joins that have a number of rows in the output comparable to that of a cross join can cause stability degradation in the join.
As an alternative to the geometry intersection join, the cross join configured with the “Geometries have intersection” filter may provide more stable memory usage. However, this approach could lead to a sharp increase in build times. Another alternative is using the geospatial-tools
PySpark library in Code Repositories. Contact your Palantir representative for more information.
Pipeline Builder's geometry distance join requires two datasets, each of which must have a geometry typed column, a value for distance greater than zero, and a coordinate reference system string which will determine the units of the distance provided. For example, if "epsg:4326" is provided for the coordinate reference system, then the distance will be assumed to be in units of degrees. Similar to the intersection join, we recommend normalizing the geometry column, and explicitly filtering out null
values if they are not needed in the output. If there is another join or non-determinism in the pipeline, add a checkpoint prior to the join.
Pipeline Builder can join datasets of small geometries (approximately up to 8 points each) with a scale of up to 1 million rows on either side, assuming a 2x increase in the number of rows as a result of the join. When the number of rows output is comparable to that of a cross join, stability may degrade.
An alternative to the geometry distance join, a cross join configured with a geometry buffer and "Geometries have intersection" filter may provide more stable memory usage when the increase in row count is large. However, this approach could sharply increase build times in most cases.
Pipeline Builder's geometry nearest neighbors join requires two datasets: a base
dataset of geometries and a neighbors
dataset of points. The k
integer parameter configures the number of nearest neighbors to find for each base geometry. A coordinate reference system is required to determine how distances between base geometries and neighbor points are calculated and compared. The result will be the set of combined rows, each of which contains a GeoPoint that is one of the k
closest points to the base geometry. Ties are broken arbitrarily, and results are returned in no particular order.
Note that this join has two requirements:
All GeoPoints in the neighbors
dataset must be able to fit inside executor and driver memory. This is currently a hard requirement and limits the scalability of the join. Contact your Palantir representative if your use case requires distributing the neighbors
dataset.
Foundry currently only accepts the GeoPoint logical type in the neighbors
dataset to limit memory consumption. Contact your Palantir representative if non-point geometries are required on the neighbors
side of the join.
In practice, Pipeline Builder supports modest values of k
(< 5) with up to a few hundred thousand rows in the neighbors dataset and 1 million geometries in the base dataset. When both datasets have a few hundred thousand rows, Pipeline Builder can support much larger values of k
. Finding up to several hundred nearest neighbors should finish quickly in such cases. Increasing the scale of the inputs beyond this point may succeed intermittently, but is not currently supported in general.
If your join is encountering stability issues, use the following steps to remediate:
Once you have finished transforming your data in Pipeline Builder, you can validate the results of these transforms visually on a map. In the regular preview pane, select the cells you would like to preview on a map (the cells must be from columns of one of the geospatial types mentioned above). Right-click and select Open Geo Preview.
A new preview tab will appear, displaying the selected cells plotted on a map.
Pipeline Builder’s geospatial capabilities are designed to integrate seamlessly with downstream data across the platform.