KNN join

Supported in: Batch

Return the 'k' nearest rows from the right dataset for each row in the left dataset, based on the distance measure.

Transform categories: Join

Declared arguments

Condition for columns to select on the left - All columns in the left input schema will be tested to see if they match this condition. If they match, the column will be selected in the output.
ColumnPredicate
Condition for columns to select on the right - All columns in the right input schema will be tested to see if they match this condition. If they match, the column will be selected in the output.
ColumnPredicate
Distance measure expression. - Distance measure expression between columns in the left and right datasets. E.g. Levenshtein distance.
Expression<Numeric>
K nearest - The number of nearest rows to return, i.e. if k=2 then the number of output rows will be at least doubled and the nearest 2 rows will be joined from the right. In case of ties, more rows may be returned.
Literal<Integer>
Left dataset - Left dataset to use in join.
Table
Rank column name - Name of the column to store the rank of the distance.
Literal<String>
Right dataset - Right dataset to use in join.
Table
optional Prefix for columns from right - Prefix to add to all columns on the right hand side.
Literal<String>

Argument values:

Condition for columns to select on the left:
columnNameIsIn(
columnNames: [tail_number, airline],
)
Condition for columns to select on the right:
columnNameIsIn(
columnNames: [fuzzy_airline, home_airport],
)
Distance measure expression.:
alias(
alias: distance,
expression:
levenshteinDistance(
ignoreCase: true,
left: airline,
right: fuzzy_airline,
),
)
K nearest: 2
Left dataset: ri.foundry.main.dataset.left
Rank column name: rank
Right dataset: ri.foundry.main.dataset.right
Prefix for columns from right: null

Inputs: ri.foundry.main.dataset.left

tail_number	airline	miles	factor
XB-123	foundry air	124	2
MT-222	new airline	1123	5
PA-452	new air	212	2

ri.foundry.main.dataset.right

Output:

rank	distance	tail_number	airline	fuzzy_airline	home_airport
1	3	PA-452	new air	old air	IAD
2	4	PA-452	new air	air	LHR
2	4	PA-452	new air	new airline	CPH
2	4	PA-452	new air	new plane	JFK
1	0	MT-222	new airline	new airline	CPH
2	4	MT-222	new airline	new plane	JFK
1	5	XB-123	foundry air	old air	IAD
2	8	XB-123	foundry air	air	LHR

Argument values:

Condition for columns to select on the left:
columnNameIsIn(
columnNames: [tail_number, airline],
)
Condition for columns to select on the right:
columnNameIsIn(
columnNames: [home_airport],
)
Distance measure expression.:
alias(
alias: distance,
expression:
levenshteinDistance(
ignoreCase: true,
left: airline,
right: airline,
),
)
K nearest: 2
Left dataset: ri.foundry.main.dataset.left
Rank column name: rank
Right dataset: ri.foundry.main.dataset.right
Prefix for columns from right: null

Inputs: ri.foundry.main.dataset.left

tail_number	airline	miles	factor
XB-123	foundry air	124	2
MT-222	new airline	1123	5
PA-452	new air	212	2

ri.foundry.main.dataset.right

Output:

rank	distance	tail_number	airline	home_airport
1	3	PA-452	new air	IAD
2	4	PA-452	new air	LHR
2	4	PA-452	new air	CPH
2	4	PA-452	new air	JFK
1	0	MT-222	new airline	CPH
2	4	MT-222	new airline	JFK
1	5	XB-123	foundry air	IAD
2	8	XB-123	foundry air	LHR

Description: If the distance measure returns null, this is considered the furthest distance. Argument values:

Condition for columns to select on the left:
columnNameIsIn(
columnNames: [tail_number, airline],
)
Condition for columns to select on the right:
columnNameIsIn(
columnNames: [fuzzy_airline, home_airport],
)
Distance measure expression.:
alias(
alias: distance,
expression:
levenshteinDistance(
ignoreCase: true,
left: airline,
right: fuzzy_airline,
),
)
K nearest: 2
Left dataset: ri.foundry.main.dataset.left
Rank column name: rank
Right dataset: ri.foundry.main.dataset.right
Prefix for columns from right: null

Inputs: ri.foundry.main.dataset.left

tail_number	airline	miles	factor
XB-123	foundry air	124	2
MT-222	new airline	1123	5
PA-452	new air	212	2

ri.foundry.main.dataset.right

Output:

rank	distance	tail_number	airline	fuzzy_airline	home_airport
1	3	PA-452	new air	old air	IAD
2	4	PA-452	new air	air	LHR
2	4	PA-452	new air	new plane	JFK
1	4	MT-222	new airline	new plane	JFK
2	7	MT-222	new airline	old air	IAD
1	5	XB-123	foundry air	old air	IAD
2	8	XB-123	foundry air	air	LHR

Argument values:

Condition for columns to select on the left:
columnNameIsIn(
columnNames: [tail_number, airline],
)
Condition for columns to select on the right:
columnNameIsIn(
columnNames: [fuzzy_airline, home_airport],
)
Distance measure expression.:
alias(
alias: distance,
expression:
levenshteinDistance(
ignoreCase: true,
left: airline,
right: fuzzy_airline,
),
)
K nearest: 2
Left dataset: ri.foundry.main.dataset.left
Rank column name: rank
Right dataset: ri.foundry.main.dataset.right
Prefix for columns from right: right_

Inputs: ri.foundry.main.dataset.left

tail_number	airline	miles	factor
XB-123	foundry air	124	2
MT-222	new airline	1123	5
PA-452	new air	212	2

ri.foundry.main.dataset.right

Output:

rank	distance	tail_number	airline	right_fuzzy_airline	right_home_airport
1	3	PA-452	new air	old air	IAD
2	4	PA-452	new air	air	LHR
2	4	PA-452	new air	new airline	CPH
2	4	PA-452	new air	new plane	JFK
1	0	MT-222	new airline	new airline	CPH
2	4	MT-222	new airline	new plane	JFK
1	5	XB-123	foundry air	old air	IAD
2	8	XB-123	foundry air	air	LHR