Chunking

This page outlines how to incorporate a basic chunking strategy into your semantic search workflows. Chunking, in this context, means breaking up larger pieces of text into smaller pieces of text. This is advantageous because embedding models possess a maximum input length for text, and crucially, smaller pieces of text will be more semantically distinct during searches. Chunking is often used when parsing large documents like PDFs.

Primarily, the objective is to split long text into smaller "chunks", each with an associated Ontology object linked back to the original object.

Chunking example

As a starting point, we will show how a basic chunking strategy can be accomplished without using code in Pipeline Builder. For more advanced strategies, we recommend using a code repository as part of your pipeline.

For illustrative purposes, we will use a simple two row dataset with two columns, object_id and object_text. For ease of understanding, the object_text examples below are purposefully short.

object_idobject_text
abcgold ring lost
xyzfast cars zoom

We initiate the process by employing the Chunk String board, which introduces an extra column containing an array of object_text segmented into smaller pieces. The board accommodates various chunking approaches, such as overlap and separators, to ensure that each semantic concept remains coherent and unique.

The below screenshot of a Chunk String board shows a simple strategy which you may alter for use toward your own use case. The below configuration would attempt to return chunks that are roughly 256 characters in size. Effectively, the board splits text on the highest priority separator until each chunk is equal to or smaller than the chunk size. If there are no more highest priority separators to split on and some of the chunks are still too large, it moves to the next separator until either all the chunks are equal or smaller than the chunk size or there are no more separators to use. Finally, the board will ensure that for each chunk identified, the chunk following has an overlap that covers the last 20 characters of the previous chunk.

Chunk string board

object_idobject_textchunks
abcgold ring lost[gold,ring,lost]
xyzfast cars zoom[fast,cars,zoom]

Next we want each element in the array to have its own row. We will use the Explode Array with Position board to transform our dataset to one with six rows. The new column in each of the rows (as seen below) is a struct (map) with two key-value pairs, the position in the array and the element in the array.

Explode chunks

object_idobject_textchunkschunks_with_position
abcgold ring lost[gold,ring,lost]{position:0, element}
abcgold ring lost[gold,ring,lost]{position:1, element}
abcgold ring lost[gold,ring,lost]{position:2, element}
xyzfast cars zoom[fast,cars,zoom]{position:0, element}
xyzfast cars zoom[fast,cars,zoom]{position:1, element}
xyzfast cars zoom[fast,cars,zoom]{position:2, element}

From there, we will pull out the position and the element into their own columns.

Get chunk position Get chunk

object_idobject_textchunkschunks_with_positionpositionchunk
abcgold ring lost[gold,ring,lost]{position:0, element}0gold
abcgold ring lost[gold,ring,lost]{position:1, element}1ring
abcgold ring lost[gold,ring,lost]{position:2, element}2lost
xyzfast cars zoom[fast,cars,zoom]{position:0, element}0fast
xyzfast cars zoom[fast,cars,zoom]{position:1, element}1cars
xyzfast cars zoom[fast,cars,zoom]{position:2, element}2zoom

To create a unique identifier for each chunk, we will convert the chunk position in its array to a string and then concatenate it to the original object ID. We will also drop the unnecessary columns.

Cast chunk position to string Create chunk id Drop unnecessary object_text, chunks, position, and chunks_with_position columns

object_idchunkchunk_id
abcgoldabc_0
abcringabc_1
abclostabc_2
xyzfastxyz_0
xyzcarsxyz_1
xyzzoomxyz_2

Now, we have six rows representing six different chunks, each with the object_id (for linking), the new chunk_id to be a new primary key, and the chunk to be embedded as described in semantic search workflow. This results in the table as follows:

object_idchunkchunk_idembedding
abcgoldabc_0[-0.7,...,0.4]
abcringabc_1[0.6,...,-0.2]
abclostabc_2[-0.8,...,0.9]
xyzfastxyz_0[0.3,...,-0.5]
xyzcarsxyz_1[-0.1,...,0.8]
xyzzoomxyz_2[0.2,...,-0.3]