The Ontology Manager application has a dedicated pipeline graph that shows the status of various jobs in a Funnel pipeline. A green tick in the Object Storage V2 node in the graph indicates that the indexing is complete and the object type is ready to be queried from OSv2.
Data validations in OSv1 and OSv2 differ slightly.
OSv1 behavior is generally dictated by the behavior of the underlying data store, given its tight coupling with the underlying distributed document store and search engine. OSv2 has stricter validations to ensure the quality of data going into the Ontology, and to provide more deterministic behavior and increased legibility across the system compared to OSv1.
Therefore, some indexing pipelines may encounter validation errors when using OSv2 that were previously accepted by OSv1. For a detailed list of such breaking changes, see the documentation on Ontology breaking changes between OSv1 and OSv2.
OSv2 manages all aspects of jobs, including job retries. If a job fails due to a transient error that might be resolved by rebuilding the job, OSv2 will automatically retry the job after approximately five minutes. If OSv2 detects that the job failed terminally (due to an invalid data format, for example), it will automatically retry only when new data is available. In cases where object types are backed by restricted view datasources, jobs are triggered when either the data or the policy changes.
The index size is mainly limited by the storage space in the object databases into which a given object type is indexed. For example, in the OSv2 data store this would be the disk space of the search nodes.
If there is not enough disk space, indexing jobs will not succeed and will report the underlying problem in the pipeline graph in the Ontology Manager application. If you encounter disk space errors, contact your Palantir representative.
There are two phases when syncing object types backed by streaming datasources; internal stream creation, and indexing. Streaming indexing jobs have comparable indexing latency to Funnel batch pipelines, using Spark to heavily parallelize the initial processing of historical streaming data. This indexing latency is comparable to user edits on live Ontology data. Internal stream creation is typically the limiting factor; it utilizes our streaming infrastructure to process the datasource on a per-record basis.
The most time-consuming part of Funnel streaming pipelines is Flink checkpointing to allow for "exactly once" streaming consistency. The default checkpoint frequency is once every second, so that is the dominating latency between the data arriving in the input stream and being indexed into the Ontology. We perform continuous experiments to evaluate cost/performance/latency tradeoffs by reducing the frequency and even removing it all together.
Contact Palantir Support to configure the behavior where necessary.
Indexing throughput is limited to 2 MB/s per object type into the Object Storage v2 object database. Contact Palantir Support if you need a higher indexing throughput.
No, Funnel streaming pipelines preserve the ordering of the input stream when indexing. Data should be written to the stream in order. This can be done in the upstream streaming pipeline by windowing the data by the event timestamp and specifying the primary keys such that the data is hash partitioned.
Funnel streaming pipelines supports create, update, and deletion workflows. You can find more documentation on how to set up deletion metadata in the change data capture documentation.
Currently, no. Resolving the entire object should be done in an upstream pipeline. If stateful streaming does not solve this problem for you due to scale issues, contact Palantir Support.
Yes, it is supported. Learn more about how to configure ontology types with streaming datasources in our documentation.
Ontology streaming is currently only supported by the Object Storage v2 object database. Contact Palantir Support if you need this functionality in other object databases.
No; given user edits are not supported for object types with stream datasources, a materialized dataset would be no different than the archive dataset in the stream. With the current architecture, the deduplicated view in a dataset cannot be provided.
Funnel streaming pipelines cannot be cancelled by users. Funnel keeps streams alive always because production object types require high availability. This setup may potentially incur unwanted cost when prototyping. Contact Palantir Support if this becomes a significant deterrent for your use case. Alternatively, you can try switching the object type to a batch one during the prototyping phase.
Funnel streaming pipelines have a heuristic to determine if its pipeline is “up to date” with a replacement stream before switching over to the new one. You can change the datasource to point to a different stream or branch of your stream with the following steps:
Funnel streaming pipeline compute is calculated the same way as normal streaming resources. Review our streaming compute usage documentation for more information on streaming resource costs.
Retention windows were initially developed as a data size limiting mechanism during beta release. Therefore, it is only implemented as a best effort. This means object instances within the retention window will be queryable, but object instances outside of it will be eventually deleted. For example, if the retention window is set to two weeks, and an object instance of the stream datasource was last updated by the input stream three weeks ago, that object instance may be deleted from that object type. However, that object instance may also stay in the Ontology for many more days and is never guaranteed to be removed within any specific timeframe.
The current mechanism for “cleaning up” old data from the Ontology is through pipeline replacement which, by default, runs every two weeks. On replacement, the Funnel streaming pipeline replays the stream from the beginning of the retention window, thereby removing older object instances from the Ontology. Contact Palantir Support if you have a need to delete old object instances more regularly.
If no retention window is set, then all data from the input stream source will be ingested into the Ontology.
In Ontology Manager, you must always explicitly specify your input data source as a stream; this applies for both object types with streaming datasources and restricted views backed by streams. Otherwise, your data source will fall back to indexing as a standard Funnel batch pipeline. Review our documentation on configuring streaming object types for more details.
Yes, querying the Ontology work the same way for streaming object types and batch object types.