K-means clustering

Supported in: Batch

K-means clustering is an unsupervised machine learning algorithm. It groups dataset vectors into k clusters. The k value is determined by computing the best silhouette score of the specified range between minimum k and maximum k. Number of k values defines how many k values should be tried within this range, inclusive of the boundaries.

Transform categories: Other

Declared arguments

  • Input dataset - Source dataset containing vector column.
    Table
  • Maximum k - Maximum number of clusters.
    Literal<Integer>
  • Minimum k - Minimum number of clusters.
    Literal<Integer>
  • Number of k values - Number of k values to test, over values from minimum k to maximum k. Note that we will train number of k clustering models, so the pipeline execution might be slow if number of k values is set to a high value. The best model is selected based on the silhouette score.
    Literal<Integer>
  • Vector column - Column containing the float vectors that will be used for the clustering.
    Column<Array<Float>>

Examples

Example 1: Base case

Argument values:

  • Input dataset: ri.foundry.main.dataset.a
  • Maximum k: 12
  • Minimum k: 3
  • Number of k values: 4
  • Vector column: feature_column

Input:

feature_column
[ 0.05, 3.1, 2.3 ]
[ 1.0, 3.1, 2.3 ]
[ 1.0, 3.5, 2.3 ]
[ 19.0, 12.3, -1.4 ]

Output:

feature_columncluster_id
[ 1.0, 3.1, 2.3 ]0
[ 1.0, 3.5, 2.3 ]0
[ 19.0, 12.3, -1.4 ]1
[ 0.05, 3.1, 2.3 ]2

Example 2: Null case

Argument values:

  • Input dataset: ri.foundry.main.dataset.a
  • Maximum k: 12
  • Minimum k: 3
  • Number of k values: 4
  • Vector column: feature_column

Input:

feature_column
[ 0.05, 3.1, 2.3 ]
null
[ 1.0, 3.1, 2.3 ]
[ 1.0, 3.5, 2.3 ]
[ 19.0, 12.3, -1.4 ]

Output:

feature_columncluster_id
[ 1.0, 3.1, 2.3 ]0
[ 1.0, 3.5, 2.3 ]0
[ 19.0, 12.3, -1.4 ]1
[ 0.05, 3.1, 2.3 ]2

Example 3: Edge case

Argument values:

  • Input dataset: ri.foundry.main.dataset.a
  • Maximum k: 3
  • Minimum k: 3
  • Number of k values: 1
  • Vector column: feature_column

Input:

feature_column
[ 0.05, 3.1, 2.3 ]
[ 1.0, 3.5, 2.3 ]
[ 19.0, 12.3, -1.4 ]

Output:

feature_columncluster_id
[ 19.0, 12.3, -1.4 ]0
[ 0.05, 3.1, 2.3 ]1
[ 1.0, 3.5, 2.3 ]2