Data connectivity & integrationPipeline Builder TransformsKNN join

KNN join

Supported in: Batch

Return the K nearest rows from the right dataset for each row in the left dataset, based on the distance measure.

Transform categories: Join

Declared arguments

  • Condition for columns to select on the left - All columns in the left input schema will be tested to see if they match this condition. If they match, the column will be selected in the output.
    ColumnPredicate
  • Condition for columns to select on the right - All columns in the right input schema will be tested to see if they match this condition. If they match, the column will be selected in the output.
    ColumnPredicate
  • Distance measure expression. - Distance measure expression between columns in the left and right datasets. E.g. Levenshtein distance.
    Expression<Numeric>
  • K nearest - The number of nearest rows to return, i.e. if k=2 then the number of output rows will be doubled and the nearest 2 rows will be joined from the right.
    Literal<Integer>
  • Left dataset - Left dataset to use in join.
    Table
  • Rank column name - Name of the column to store the rank of the distance.
    Literal<String>
  • Right dataset - Right dataset to use in join.
    Table
  • optional Prefix for columns from right - Prefix to add to all columns on the right hand side.
    Literal<String>

Examples

Example 1: Base case

Argument values:

  • Condition for columns to select on the left:
    columnNameIsIn(
     columnNames: [tail_number, airline],
    )
  • Condition for columns to select on the right:
    columnNameIsIn(
     columnNames: [fuzzy_airline, home_airport],
    )
  • Distance measure expression.:
    alias(
     alias: distance,
     expression:
    levenshteinDistance(
     ignoreCase: true,
     left: airline,
     right: fuzzy_airline,
    ),
    )
  • K nearest: 2
  • Left dataset: ri.foundry.main.dataset.left
  • Rank column name: rank
  • Right dataset: ri.foundry.main.dataset.right
  • Prefix for columns from right: null

Inputs: ri.foundry.main.dataset.left

tail_numberairlinemilesfactor
XB-123foundry air1242
MT-222new airline11235
PA-452new air2122

ri.foundry.main.dataset.right

fuzzy_airlinehome_airport
airLHR
new airlineCPH
new planeJFK
old airIAD

Output:

rankdistancetail_numberairlinefuzzy_airlinehome_airport
13PA-452new airold airIAD
24PA-452new airairLHR
24PA-452new airnew airlineCPH
24PA-452new airnew planeJFK
10MT-222new airlinenew airlineCPH
24MT-222new airlinenew planeJFK
15XB-123foundry airold airIAD
28XB-123foundry airairLHR

Example 2: Base case

Argument values:

  • Condition for columns to select on the left:
    columnNameIsIn(
     columnNames: [tail_number, airline],
    )
  • Condition for columns to select on the right:
    columnNameIsIn(
     columnNames: [home_airport],
    )
  • Distance measure expression.:
    alias(
     alias: distance,
     expression:
    levenshteinDistance(
     ignoreCase: true,
     left: airline,
     right: airline,
    ),
    )
  • K nearest: 2
  • Left dataset: ri.foundry.main.dataset.left
  • Rank column name: rank
  • Right dataset: ri.foundry.main.dataset.right
  • Prefix for columns from right: null

Inputs: ri.foundry.main.dataset.left

tail_numberairlinemilesfactor
XB-123foundry air1242
MT-222new airline11235
PA-452new air2122

ri.foundry.main.dataset.right

airlinehome_airport
airLHR
new airlineCPH
new planeJFK
old airIAD

Output:

rankdistancetail_numberairlinehome_airport
13PA-452new airIAD
24PA-452new airLHR
24PA-452new airCPH
24PA-452new airJFK
10MT-222new airlineCPH
24MT-222new airlineJFK
15XB-123foundry airIAD
28XB-123foundry airLHR

Example 3: Base case

Description: If the distance measure returns null, this is considered the furthest distance. Argument values:

  • Condition for columns to select on the left:
    columnNameIsIn(
     columnNames: [tail_number, airline],
    )
  • Condition for columns to select on the right:
    columnNameIsIn(
     columnNames: [fuzzy_airline, home_airport],
    )
  • Distance measure expression.:
    alias(
     alias: distance,
     expression:
    levenshteinDistance(
     ignoreCase: true,
     left: airline,
     right: fuzzy_airline,
    ),
    )
  • K nearest: 2
  • Left dataset: ri.foundry.main.dataset.left
  • Rank column name: rank
  • Right dataset: ri.foundry.main.dataset.right
  • Prefix for columns from right: null

Inputs: ri.foundry.main.dataset.left

tail_numberairlinemilesfactor
XB-123foundry air1242
MT-222new airline11235
PA-452new air2122

ri.foundry.main.dataset.right

fuzzy_airlinehome_airport
airLHR
nullCPH
new planeJFK
old airIAD

Output:

rankdistancetail_numberairlinefuzzy_airlinehome_airport
13PA-452new airold airIAD
24PA-452new airairLHR
24PA-452new airnew planeJFK
14MT-222new airlinenew planeJFK
27MT-222new airlineold airIAD
15XB-123foundry airold airIAD
28XB-123foundry airairLHR

Example 4: Base case

Argument values:

  • Condition for columns to select on the left:
    columnNameIsIn(
     columnNames: [tail_number, airline],
    )
  • Condition for columns to select on the right:
    columnNameIsIn(
     columnNames: [fuzzy_airline, home_airport],
    )
  • Distance measure expression.:
    alias(
     alias: distance,
     expression:
    levenshteinDistance(
     ignoreCase: true,
     left: airline,
     right: fuzzy_airline,
    ),
    )
  • K nearest: 2
  • Left dataset: ri.foundry.main.dataset.left
  • Rank column name: rank
  • Right dataset: ri.foundry.main.dataset.right
  • Prefix for columns from right: right_

Inputs: ri.foundry.main.dataset.left

tail_numberairlinemilesfactor
XB-123foundry air1242
MT-222new airline11235
PA-452new air2122

ri.foundry.main.dataset.right

fuzzy_airlinehome_airport
airLHR
new airlineCPH
new planeJFK
old airIAD

Output:

rankdistancetail_numberairlineright_fuzzy_airlineright_home_airport
13PA-452new airold airIAD
24PA-452new airairLHR
24PA-452new airnew airlineCPH
24PA-452new airnew planeJFK
10MT-222new airlinenew airlineCPH
24MT-222new airlinenew planeJFK
15XB-123foundry airold airIAD
28XB-123foundry airairLHR