Create time series data with Pipeline Builder

The pipeline you will create with this guide will generate time series data that backs a time series sync to associate with time series properties on the Carrier, Route, and Airport object types to create new time series objects. This pipeline involves a more complex set up than standard mappings from time series datasets to time series properties, as we will make calculations on non-time series data to generate time series data. Review our Pipeline Builder documentation for information on general pipeline guidance.

The flight dataset we are working with includes the following columns that we will use to create time series properties:

  • flight_id: string | A unique string to identify the flight and every row in the dataset.
  • date: date | The date the flight took place.
  • destination_airport_id: string | The string to identify the destination airport.
  • airline_id: string | The ID of the airline carrier.
  • origin_airport_id: string | The ID of the origin airport.
  • dep_delay: integer | The number of minutes the departure was delayed.
  • arr_delay: integer | The number of minutes the arrival was delayed.
  • route_id: string| A unique string to identify the route.

The pipeline at the end of this guide will look like this:

Overview of final time series pipeline

Part I: Generate time series data

Using the same flights dataset that is used to back the Flightsobject type, we can perform some aggregation transformations and generate time series data based on flight metrics. Note that this step is not necessary if you have time series data coming into Foundry from a historian or edge sensor. You can move on to generate a time series sync.

1. Apply transforms to Carrier and Route object types

From the flights dataset, apply transforms using the steps below. You will do this for both the Carrier and Route object types.

Aggregate the data

Use the aggregate transform to group by date and ID of the object, (in this case, using the route_id for the Route object type; you will separately need to do the same using airline_id for the Carrier object type), and calculate average arrival delays, average departure delays, and daily flight counts.

The aggregate transform board in Pipeline Builder.

After aggregating, the dataset should preview with the following schema:

route_iddatedaily_avg_dep_delaydaily_avg_arr_delaydaily_count_of_flights
ATL -> SFO2023-06-1233.454545454545000040.000000000000000011
ATL -> FLL2023-08-2429.727272727272000019.409090909090910022
ATL -> TVC2023-07-05-8.0000000000000000-8.00000000000000001

Cast to a new data type

To use this new data as a time series, we must create a timestamp type column. To do this, use the cast transform to cast the date column to a timestamp type column. We will also soon apply an unpivot transform to merge daily_avg_dep_delay, daily_avg_arr_delay, and daily_count_of_flights values into one column. Since this function requires that all values be of the same data type, we must also cast our daily count of flights metric to a double type (the same data type as the average delay metrics).

The cast board in Pipeline Builder, showing a cast to timestamp and a cast to double

Unpivot to merge time series values

Since this dataset contains time series data in different columns, we must use an unpivot transform to merge it into one value column so the data can match the required schema for a time series sync, as shown below:

  • series ID: string | The series ID for the set of timestamp and value pairs referred to by a TSP, which must match the TSP's series ID.
  • timestamp: timestamp or long | The time at which the quantity is measured.
  • value: integer, float, double, string | The value of the quantity at the point that it is measured. A string type indicates a categorical time series; each categorical time series can have, at most, 127 unique variants.

The unpivot transform shown below places values for daily_avg_dep_delay, daily_avg_arr_delay, and daily_count_of_flights into the same series_value column. Those original column names are outputs to the new series_name column that will be used in the series_id.

The unpivot transform board in Pipeline Builder

The dataset schema should now appear as follows:

series_nameseries_valueroute_iddate
daily_avg_dep_delay33.4545454545450000ATL -> SFO2023-06-12T00:00:00.000Z
daily_avg_arr_delay40.0000000000000000ATL -> SFO2023-06-12T00:00:00.000Z
daily_count_of_flights11.0000000000000000ATL -> SFO2023-06-12T00:00:00.000Z

Concatenate string values to create the series ID

Now, we can use the concatenate strings transform to create the series ID (the identifier for the associated time series values). Use the transform to combine the series_name (what each sensor represents) with the primary key of each object.

The concatenate strings transform board in Pipeline Builder.

series_idseries_nameseries_valueroute_iddate
CMH -> IAH_daily_avg_dep_delaydaily_avg_dep_delay33.4545454545450000ATL -> SFO2023-06-12T00:00:00.000Z
CMH -> IAH_daily_avg_arr_delaydaily_avg_arr_delay40.0000000000000000ATL -> SFO2023-06-12T00:00:00.000Z
CMH -> IAH_daily_count_of_flightsdaily_count_of_flights11.0000000000000000ATL -> SFO2023-06-12T00:00:00.000Z

Select necessary columns

Using the select columns transform, we will only keep the columns that are required for the time series sync: series_id, series_value, and date. The flights backing dataset will hold time series values for all series, regardless of what they are measuring. Repeat this for the airline_carrier_id column (from the flights dataset).

The select columns transform board in Pipeline Builder

series_idseries_valuedate
CMH -> IAH_daily_avg_dep_delay33.45454545454500002023-06-12T00:00:00.000Z
CMH -> IAH_daily_avg_arr_delay40.00000000000000002023-06-12T00:00:00.000Z
CMH -> IAH_daily_count_of_flights11.00000000000000002023-06-12T00:00:00.000Z

2. Add a transform to aggregate and generate data for origin and destination airports

Now, you must repeat the aggregate and cast transform steps for both origin airports and destination airports.

Aggregate for the number of flights per day per route

Use the aggregate transform to group by date and origin_airport_id, then calculate the average arrival and departure times. The total number of rows in each group is equivalent to the number of flights per day per route.

The aggregate transform board in Pipeline Builder, used to aggregate the number of flights per day

dateorigin_airport_iddaily_avg_dep_delaydaily_count_of_departing_flights
2023-07-02102999.3437500000000000033
2023-09-0610431-2.33333333333333336
2023-01-1210620-7.00000000000000002

Cast to timestamp

To use this new data as a time series, we must create a timestamp column. To do this, use the cast transform to cast the date column to a timestamp type column.

The cast transform board in Pipeline Builder, used to cast the data to a timestamp type

3. Create a join to combine destination and origin airports

Using the join board, create a left join that combines data from the destination airport and origin airport, resulting in complete time series properties for airport data. Be sure the following configuration are set for your join:

  • Match the date and origin_airport_id to the dest_airport_id.
  • Auto-select columns from the left dataset.
  • As the right columns, select the two that represent daily average delay and the daily count of flights.

The join board in Pipeline Builder, configured to create a left join for airport data.

dateorigin_airport_iddaily_avg_dep_delaydaily_count_of_departing_flightsdaily_avg_arr_delaydaily_count_of_arriving_flights
2023-07-02102999.343750000000000003318.529411764705884034
2023-09-0610431-2.33333333333333336-8.00000000000000006
2023-01-1210620-7.0000000000000000256.50000000000000002

4. Apply transforms to format data for a time series sync

Rename column

Now that we joined the origin airport data with the destination airport data, we have both arrival and departure metrics for all airports. We no longer need to differentiate origin from destination, so we can use the rename columns transform to change origin_airport_id to simply airport_id.

The rename column transform board in Pipeline Builder

The data should preview as follows with the renamed column:

dateairport_iddaily_avg_dep_delaydaily_count_of_departing_flightsdaily_avg_arr_delaydaily_count_of_arriving_flights
2023-07-02102999.343750000000000003318.529411764705884034
2023-09-0610431-2.33333333333333336-8.00000000000000006
2023-01-1210620-7.0000000000000000256.50000000000000002

Cast to double

We will soon apply an unpivot transform. This function requires that all values be of the same data type, so we must use the cast transform board again to cast our daily count of flights metrics to a double data type so they are the same type as the average delay metrics.

Add flight numbers

To calculate the full daily flight count, we will use the add numbers transform to add together the daily count of arriving flights and the daily count of departing flights, as shown below.

The cast and add numbers transform boards in Pipeline Builder

daily_count_of_flightsdateairport_iddaily_avg_dep_delaydaily_count_of_departing_flightsdaily_avg_arr_delaydaily_count_of_arriving_flights
772023-07-02102999.343750000000000003318.529411764705884034
122023-09-0610431-2.33333333333333336-8.00000000000000006
42023-01-1210620-7.0000000000000000256.50000000000000002

Unpivot to merge series values

Since this dataset contains time series data in different columns, we must use an unpivot transform to merge it into one value column so the data can match the required schema for a time series sync, as shown below:

  • series ID: string | The series ID for the set of timestamp and value pairs referred to by a TSP, which must match the TSP's series ID.
  • timestamp: timestamp or long | The time at which the quantity is measured.
  • value: integer, float, double, string | The value of the quantity at the point that it is measured. A string type indicates a categorical time series; each categorical time series can have, at most, 127 unique variants.

The unpivot transform shown below places values for daily_avg_dep_delay, daily_avg_arr_delay, and daily_count_of_flights into the same series_value column. Those original column names are outputs to the new series_name column that will be used in the series ID.

The unpivot transform board in Pipeline Builder, configured to create a series_name column output

The data should preview with the following schema:

series_nameseries_valuedateairport_id
daily_count_of_flights772023-07-02T00:00:00.000Z10299
daily_avg_dep_delay9.343750000000000002023-07-02T00:00:00.000Z10299
daily_avg_arr_delay18.52941176470588402023-07-02T00:00:00.000Z10299

Concatenate string values to create the series ID

Now, we can use the concatenate strings transform to create the series ID (the identifier for the associated time series values). Use the transform to combine the series_name (what each sensor represents) with the primary key of the Airport object (airport_id).

The concatenate strings board in Pipeline Builder, configured to combine the series_name with the airport_id

series_idseries_nameseries_valuedateairport_id
12099_daily_count_of_flightsdaily_count_of_flights772023-07-02T00:00:00.000Z10299
12099_daily_avg_dep_delaydaily_avg_dep_delay9.343750000000000002023-07-02T00:00:00.000Z10299
12099_daily_avg_arr_delaydaily_avg_arr_delay18.52941176470588402023-07-02T00:00:00.000Z10299

Select necessary columns

Using the select columns transform, we will only keep the columns that are required for the time series sync: series_id, series_value, and date. The flights backing dataset will hold time series values for all series, regardless of what they are measuring.

The select columns board in Pipeline Builder

The resulting dataset should look as follows:

series_idseries_valuedate
12099_daily_count_of_flights772023-07-02T00:00:00.000Z
12099_daily_avg_dep_delay9.343750000000000002023-07-02T00:00:00.000Z
12099_daily_avg_arr_delay18.52941176470588402023-07-02T00:00:00.000Z

5. Union the time series properties into a backing dataset

Create a union with the type Union by name, using the transforms representing the Carrier, Route, and Airport time series properties.

Three time series properties nodes are selected for unioning from the Pipeline Builder graph

The union board in Pipeline Builder, configured to union the three TSP property sets by name

series_idseries_valuedate
12099_daily_count_of_flights772023-07-02T00:00:00.000Z
12099_daily_avg_dep_delay9.343750000000000002023-07-02T00:00:00.000Z
12099_daily_avg_arr_delay18.52941176470588402023-07-02T00:00:00.000Z
CMH -> IAH_daily_avg_dep_delay-8.00000000000000002023-03-21T00:00:00.000Z
20304_daily_avg_arr_delay9.125000000000000002023-08-13T00:00:00.000Z

Part II: Create the time series sync

1. Remove null values

Apply a filter transform on the resulting dataset to remove any null values.

The filter transform board in Pipeline Builder, configured to remove null values.

2. Configure the time series sync

Now, create a time series sync by selecting Add from the pipeline output section to the right of the screen. Then, select Time series sync. Fill out the necessary data for the new time series sync, with the following considerations:

  • The title “[Example] Time series sync | Events” will correspond to the resulting resource in your Palantir filesystem folder.
  • Select the series_id column for the Series ID field.
  • Add the created date timestamp column in the Time field.
  • Add series_value to the Value field.

Now, save and build the pipeline. The output will be created in the same folder as the pipeline.

3. Use a time series sync to add properties to object types

Now that you created a pipeline with a time series sync, you are ready to use the sync to add time series properties to the Route, Carrier and Airport object types. Move on to our documentation for adding time series properties to object types for more guidance.