This page covers advanced details about how dataset projections work in practice to enable more optimized queries on datasets. To learn about projections at a higher level, see this page.
Internally, a projection is a copy of a dataset, optimized for some access pattern. Foundry stores projections on a dataset as child datasets. These are called "projection" datasets. The parent datasets are called "canonical" or "projected" datasets.
A Build keeps the projection up to date with the most recent data in the main dataset. If a projection is not up to date, it will still be used. However, it might not provide much benefit.
If the projected dataset receives a new SNAPSHOT
transaction, any downstream projections are entirely out of date and have no benefit until the projects rebuild. If a projected dataset receives an APPEND
transaction, downstream projections are only partially out of date relative to the new transaction. Foundry queries are rewritten to benefit from the projection if they can while still producing results that reflect the new data.
At a low level, a projection is either:
A projection dataset is stored as a Foundry dataset. This dataset is not visible as a resource but can be accessed via the link in the Projections
tab.
To keep them up to date, projections are built asynchronously through the normal Foundry build system. This lets users read projected datasets consistently and immediately after a build, but the projection datasets must be built periodically to keep them from becoming out of date.
To allow flexibility in allocating compute resources and controlling costs, Foundry will not automatically create these builds. To configure them, use the scheduler widget in the Projections
tab.
There is no universal rule for the appropriate build cadence. The primary determining factor is that queries need to be able to execute within their performance targets on the unprojected portion of the dataset. For example, if your pipeline writes 10 GB per hour, and you have determined that a filtered read should scan no more than 100 GB to meet your performance targets, you should make sure that the projections build at least every 10 hours.
Projections use an auto-scaling mechanism to find the right number of executors to build a projection. You do not need to manually adjust Spark profiles unless projection builds fail or take too long.
Foundry will attribute any costs associated to the projection (for example its storage and compute) to the project of the main dataset.
Compaction is the primary maintenance operation performed on projections. It refers to the process of taking large collections of small sorted files, and combining them into larger sorted files Compaction occurs automatically on projections as a part of the projection build process.
Compaction makes read performance independent of the number of input transactions on the main dataset. This allows projections to speed up reads of frequently incrementally written or streaming datasets. Projection builds might occasionally run for longer than average. This is usually due to a compaction.
If a projection is available to satisfy a query, it will always be chosen ahead of the main dataset, even if the main dataset is written in a way that would otherwise be more optimal to support a given query. This greatly simplifies the semantics around query planning.
For the following queries, here is the priority assigned to various projections during query planning:
x = 1 AND y = 2
, projections will be selected in
the following priority:
x
and y
x
and y
(and bucketed on any set of columns)x
x
(and bucketed on any set of columns)F
and joins on column J
, projections will be
preferred according to the following priority:
F
F
F
(and bucketed on any column other than column J
)J
(and locally sorted on anything other than column F
)These priorities reflect the view that filters are typically selective enough so that it is better to optimize for the filter versus the join, though this may not always be the case.