Time bounded drop duplicates

Supported in: Streaming

Drops duplicate rows from the input for given column subset, rows seen will expire after configured amount of event time. Row that arrive late by an amount greater than the configured amount of event time will always be dropped. Partitions by keys specified. Each drop duplicates will be computed separately for distinct key column values.

Transform categories: Other

Declared arguments

  • Dataset - Dataset to deduplicate rows.
    Table
  • Key expiration time unit - Unit for amount of time to wait for data to deduplicate over.
    Enum<Days, Hours, Milliseconds, Minutes, Seconds, Weeks>
  • Key expiration time value - Value for the amount of time to wait for data to deduplicate over.
    Literal<Long>
  • optional Column subset - If any columns are specified only those will be used when determining uniqueness, otherwise the key subset that the stream is keyed by is implicitly used to determine uniqueness.
    Set<Column<AnyType>>
  • optional Eviction window slide milliseconds - Value for how long the tumbling window of eviction should be, indicating the cadence at which stale state will be evicted. State is considered stale when more than the specified timeout in event-time has elapsed. Changing this value is considered a state break and will require a replay.
    Tuple<Literal<Long>, Enum<Days, Hours, Milliseconds, Minutes, Seconds, Weeks>>
  • optional Key by columns - Columns on which to partition the input by key. Each drop duplicates will be computed separately in parallel for each distinct key value.
    Set<Column<AnyType>>