2. [Repositories] Introduction to Data Transformations7. Building Your Dataset

7 - Building your dataset

This content is also available at learn.palantir.com ↗ and is presented here for accessibility purposes.

📖 Task Introduction

Once you’ve developed, previewed, and committed your code and CI checks are passing without error, you’re able to initiate a dataset build that will generate the output dataset on your branch. Because Foundry datasets, like repositories, exist on branches, you can have different versions of the data that reflect the underlying code on that branch. As mentioned in the product documentation, you should use uniform branch names across all stages of your pipeline to ensure downstream branches read from the correct upstream branch.

Take a moment to review this documentation page on repositories, datasets, and fallback branches to learn how branch names impact upstream and downstream data transformations.

🔨 Task Instructions

  1. From the code editor of your code repository, select the Build in the top right of the screen.

    Depending on the state of your repository, kicking off a build may trigger a new CI check. If it does, wait a few moments for it to complete. If you want to monitor the check process, you can click the View details link in the Build helper window at the bottom of your screen and then return to the Code tab when desired.

  2. Once the dataset build begins, you can view the progress bar in the Build helper window. For additional info on the build process—including Spark behavior—ctrl+select the View build button in the helper window to monitor the build in the Job Tracker application. We’ll return to the Job Tracker in future tutorials.

  3. When the build has successfully finished, you can access the output dataset in a number of ways. Ctrl+click on the output path on line 6 of your code to open the dataset in a new tab. If this isn’t working of you, refresh your browser; all transform outputs that have been successfully built are hyperlinked to their location in Foundry from the code editor.

    The dataset will open on your branch in the dataset application.

  4. Return to your code repository.

  5. Your repository uses a resource called the Shrinkwrap to map input/output paths to actual dataset resource IDs (RID). This enables you to move your input/output files around in Foundry without confusing your transform (which currently only lists the file paths.

    Between lines 4 and 5 in your code editor, click the hyperlinked text: “Replace paths with RIDs.” This will ensure your input and output values unequivocally indicate the precise dataset they’re referring to (rather than to a potentially stale file path).

  6. This change to your code must be committed to the branch. Click the Commit button and enter a short, meaningful commit message ↗, such as "refactor: replace paths with RIDs."