In Pipeline Builder, unique IDs facilitate tracking, processing, and analysis of the data, ensuring that each record can be individually identified and properly handled. For this reason, it is often necessary to create unique identifiers (IDs) for records. This section explains why using a monotonically increasing ID is not the best approach and why the preferred method for generating unique IDs is the concatenation of string columns followed by a SHA256 hash.
The best approach to generate unique IDs is to concatenate string columns from the input data and then create a SHA256 hash of the concatenated string.
To generate unique IDs using this method in Pipeline Builder, follow these steps within the Pipeline Builder transform path:
This method has several advantages:
By using the concatenation of string columns followed by a SHA256 hash, you can generate unique IDs that are scalable, secure, and consistent, making it an ideal choice for your data pipeline application.
While monotonically increasing IDs are not supported in Pipeline Builder, they are often used by data engineers who are farmiliar with Spark. Monotonically increasing IDs are generated sequentially, such as 1, 2, 3, and so on. While this approach has an inherent simplicity, it has several disadvantages:
These disadvantages indicate that using monotonically increasing IDs is not the best approach for generating unique identifiers in a data pipeline application. Instead, as detailed in the previous section, we recommend using the concatenation of string columns followed by a SHA256 hash.
Be aware that this will not be consistent across builds or previews. This method should be an absolute last resort if a unique set of columns can not be identified.
If you do not have a set of columns that define a unique row in your data, you can use the hash of a random number to create the ID. To create an ID in this way, follow the steps below within the Pipeline Builder transform path: