Other

Collections

  • array(*cols)
  • array_contains(col, value)
  • size(col)
  • sort_array(col, asc=True)
  • struct(*cols)

Sorting

  • asc(col)
  • desc(col)

Binary

  • bitwiseNOT(col)
  • shiftLeft(col, numBits)
  • shiftRight(col, numBits)
  • shiftRightUnsigned(col, numBits)

Dealing with null values

  • coalesce(*cols)
  • isnan(col)
  • isnull(col)

Columns

  • col(col) or column(col)
  • create_map(*cols)
  • explode(col)
  • expr(str)
  • hash(*cols)
  • input_file_name()
  • posexplode(col)
  • sha1(col)
  • sha2(col, numBits)
  • soundex(col)
  • spark_partition_id()

JSON

  • from_json(col, schema, options={})
  • get_json_object(col, path)
  • json_tuple(col, *fields)
  • to_json(col, options={})

Checkpoints

  • checkpoint(eager=True)
    • You can set a custom checkpoint directory by using the setCheckpointDir(dir) function on the Spark context, which is accessible through ctx.spark_session.sparkContext. Make sure to include ctx as an input parameter to the compute() function of your transform.
    • Keep in mind that you only need to set the checkpoint directory once. Any subsequent attempt at setting the checkpoint to the same directory will result in RDD errors.
  • localCheckpoint(eager=True)

The checkpoint() function is used to temporarily store a DataFrame on disk, whereas localCheckpoint() stores them in executor memory. You will not need to set a directory when using localCheckpoint(). Use the eager parameter value to set whether or not the DataFrame is checkpointed immediately (default value is True).