This guide is meant to provide guidance for pipeline developers who are developing transformations. The General recommendations may also be of interest to project managers or platform administrators as they focus on higher-order principles of clean pipelines in general.
Pipeline development is software development
Many of the best practices from general software development apply equally to defining the transformations that make up your data pipeline. Below are a few common practices that exemplify this approach:
snake_case
- following this convention means anyone else who wants to reference your awesome dataset knows that it's awesome_dataset
, not AwesomeDataset
or Awesome_Dataset
.SNAPSHOT
transaction to overwrite the previous one. More challenges are likely to arise from trying to delete a dataset than from creating a new dataset in the same location.Foundry will check for circular dependencies on the branch being developed on, but will not run the check across all branches while writing code, such as if there are ontology writeback datasets that only exist on the master branch. Foundry will still fail checks if circular dependencies are detected on other branches when attempting to merge the feature branch with the branch that contains the circular dependencies.
See Recommended Project structure for a description of an overarching model to organize the entire flow of data through multiple Projects.
/Documentation
folder and a /Output
folder. Different use-case or workflow Projects will have needs for more specific names, but always consider that your Project structure and naming scheme are signposts for visitors./scratch
folder in your Project for experiments or throw-away work rather than building in your home folder.dataset1
, dataset2
, and so on) or names that are a single letter will make it more difficult to read and refactor your code and much more difficult for a new developer to approach your work./raw/my_important_dataset
and then /clean/my_important_dataset
, however in many cases this naming pattern can create confusion in views where only the dataset name itself is prominently displayed. Remember, provenance is tracked and easily visible, so you don't need to embed this kind of "state" into the names of your datasets.Explicitly cast column types: If you are working in a Datasource Project, explicitly cast the column types in the raw
→ clean
transform, even if the schema inference from the data connection has chosen correct values. This will help catch breaking changes from the source system if a column type changes or an invalid value creates an incorrect inference during the sync.
Use Timestamps for 'Time Only' data types: Spark doesn't have a time-only data type for fields with values like “10:59:00”. In order to leverage the time functions that come with Spark's timestamp type, cast the values to seconds and then add -2208988800
to it before casting it to a timestamp in order to put it in year 0. Alternatively, leave it as a string and let the users parse it as they need to.
Casting all numeric fields to numeric datatypes: Consider a column of aircraft IDs, like 545
, 972
, 314
. It can be tempting to cast these to an integer column (after all, they look like integers and may even be integers in the source system). However, this has significant drawbacks:
545.0
, 1,234
, right justified).Storing timestamps in different timezones: — Spark timestamps are timezone agnostic. They are stored internally in UTC (aka Zulu, GMT); displaying a timezone is expected to be done by the front end.
from_unixtime()
function ↗ to store a string in the appropriate timezone.to_utc_timestamp()
function ↗ to normalize them to UTC.Commenting out code in commits: When making changes, it can be tempting to comment out old code and leave it for reference or in case you need to revert later. You can use comments while iterating, but do not commit code with statements commented out. Doing so builds cruft and reduces legibility. Old code is easy to find in previous commits.
Comments with authorship details and dates: These are automatically tracked in the commits/git repository. Manually commenting them is prone to not being updated and creates cruft.
Over-verbose Commenting: Comments should share the rationale behind decisions rather than explain the logic itself. Strive to write “self-documenting” code; if a set of statements is difficult to understand, that is a clear sign to refactor and simplify.
Protect the master
branch: If you're developing with a team, or even just working on a long-lived individual project, protect the master branch and practice GitFlow ↗ or your preferred development workflow. The key concepts are simply to ensure that code moving to master is reviewed and tested.
Write commit messages: Commit messages are the log of all activity in the repository. Take the time to write a useful description of your change:
Prune your branches: In long-lived repositories, branches can accumulate. If development of a branch is abandoned, especially if a branch is merged into another, keep things tidy by deleting the branch. This helps with legibility of which branches are actively developed.
Upgrade your Repository: When prompted, follow the steps to upgrade the language bundles in your repository. This process will open a pull request to the active branch containing the upgrades. You should feel free to run a build of your pipeline on the upgrade branch to ensure that none of the version bumps impact your code. However, staying up-to-date with these upgrades often ensures you do not encounter edge cases seen elsewhere, which are patched in the upgraded versions.
Practice Code Reviews: As you collaborate with teammates to develop transformations, implement some practice of code reviews during the pull request process. We've shared our thoughts on Code Review Best Practices ↗ and many of the concepts will apply equally to reviewing data transformation code.
Share code between repositories: Repositories in Foundry operate on a Project level for a variety of reasons, but often there is logic that could be reused across pipelines, which has several advantages:
General code reuse in accordance with DRY.
Avoiding forked/inconsistent logic across different areas of the data foundation.
There may be pipelines which would ideally use foundational pipelines but which have much stricter SLAs or performance requirements. In this case, the solution is often to share the logic but not the transforms/datasets so that the critical pipeline can rely on pre-filtered datasets, fewer transforms, have different build schedules, and so on.
Code Repositories are an excellent way to accomplish this. For example, when working with Python transforms, libraries can publish themselves to a Conda channel and allow other repositories to consume them. See the documentation on Sharing Python Libraries.
An additional advantage of shared repositories is semantic versioning. The shared repository can tag its commits with versions (e.g. 1.0.0, 1.0.1, 2.0.0) and the consuming libraries can choose how they pick up new versions. For example, a repository might choose to take the latest version (2.0.0 above) or to only take a specific version (say, 1.0.1) and defer picking up new versions until they manually decide to. The latter case is particularly valuable when the pipeline is critical and the pipeline owners would like a chance to opt into/approve changes to the shared repository.
Releases: Along the same lines, if a team wants to have an explicit release schedule for pipelines, one option (which avoids staging instances or long-lived develop branches) is to factor the logic out into functions in a shared repository and use the semantic version to keep the consuming repository at major releases, such as 1.0.0, 2.0.0, or even .0.0. That way, the developers can continue to iterate on the logic and tag intermediate releases without them going live on master. Moreover, on a branch on the consuming repository, the developers can always pick up intermediate versions as long as they do not merge them to master before the release date.
Unit testing is a popular way of improving and maintaining code quality. In unit testing ↗, small and discrete components ("units") of software are tested in an individual, independent, and automated fashion.
Python unit tests: You can enable pytest
unit tests as part of the CI checks for your Python transforms repository by following the Python unit test instructions.
Java unit tests: The steps for configuring unit tests for Java transforms can be found in the Java unit tests documentation.
Unit tests should:
Check your data health: Often, once a portion of a pipeline is completed, it is easy to set the schedule and put it out of mind. However, even if your logic is sound, the incoming data can change in ways that affect your build, leading to slower performance, increased data scale, or outright build failure. Configuring basic checks on dataset size and build time, even if you do not configure alerts, will provide a view over time of these key metrics so you can observe, for instance, the rate of increase in dataset size or the average build time for the dataset. Read more about the specific health checks available and how to configure them.
Extend health checks: In most cases, the default health check configurations should be sufficient. If you need further flexibility, however, consider adding one or more derived health check
datasets to your pipeline. The transform for this dataset can perform arbitrary logic to determine the validity of its input dataset (the dataset you are validating), and then output data formatted so that a simple health check, like Allowed Value
, can report if the dataset is valid.
Set a schedule for this dataset to build whenever there is an update on the input dataset and you will have an extra set of comprehensive health checks.
Review the scheduling best practices.