The following are some frequently asked questions about Dataset Preview.
For general information, see the Dataset Preview overview.
When uploading a CSV with nested double-quotes and embedded newline (\n
) characters within quoted fields, the schema inference will fail, and you cannot use the Schema Editor to create a valid schema.
To troubleshoot, perform the following steps:
dataFrameReaderClass
to com.palantir.foundry.spark.input.DataSourceDataFrameReader
."customMetadata"
object: "customMetadata": {
"format": "csv",
"options": {
"header": true,
"multiLine": true
}
}
When a dataset is composed of multiple CSVs (for example, through a Data Connection APPEND
transaction), and some of those CSVs contain more columns, the schema inference will fail. One option is to ignore jagged rows (such as rows that are missing certain columns). To do this, select Edit schema, expand the Parsing options section, and check Ignore jagged rows.
However, if you want to keep the jagged rows and specify a standardized schema for the dataset, then this section applies. If the conditions outlined in the Assumptions section below hold for your data, then the troubleshooting steps will produce a dataset with a standard schema defined by the user, in which jagged rows are autopopulated with null
for the columns which they are missing.
Symptoms of parsing failures:
You may encounter an error message such as the following: "Could not load preview: Encountered an error parsing the input CSV data. Make sure all data is correctly formatted."
You may also encounter the following error message after selecting Edit Schema and then Save and Validate: "Your dataset failed to validate on x rows."
Assumptions:
You can define the desired schema, such as all column names and types that the dataset should possess.
Your schema enforces strict column ordering. For example, if you want the dataset to contain and show columns {a, b, c}, an underlying CSV can have a column structure like:
but cannot have a column structure like:
but cannot have a column structure like:
Here is an example of a case in which the troubleshooting steps would be applicable and useful:
CSVs are regularly added to a dataset through APPEND
transactions. One day, a new column is added and is the new last column in the CSV. In the dataset, rows from all previously-appended CSVs should have the new column, with field values autopopulated with null
rather than being considered jagged.
The troubleshooting steps do not replicate the functionality of the mergeSchema
option available for raw Parquet datasets (which are parsed with com.palantir.foundry.spark.input.ParquetDataFrameReader
as the dataFrameReaderClass
). A user-written transform is required to replicate such functionality on a raw CSV dataset.
To troubleshoot, perform the following steps:
fieldSchemaList
to ensure it includes all the columns that the dataset should possess. For example, if the dataset should have columns {a, b, c}, all with integer types, the fieldSchemaList
may look like the following: "fieldSchemaList": [
{
"type": "INTEGER",
"name": "a",
"nullable": null,
"userDefinedTypeClass": null,
"customMetadata": {},
"arraySubtype": null,
"precision": null,
"scale": null,
"mapKeyType": null,
"mapValueType": null,
"subSchemas": null
},
{
"type": "INTEGER",
"name": "b",
"nullable": null,
"userDefinedTypeClass": null,
"customMetadata": {},
"arraySubtype": null,
"precision": null,
"scale": null,
"mapKeyType": null,
"mapValueType": null,
"subSchemas": null
},
{
"type": "INTEGER",
"name": "c",
"nullable": null,
"userDefinedTypeClass": null,
"customMetadata": {},
"arraySubtype": null,
"precision": null,
"scale": null,
"mapKeyType": null,
"mapValueType": null,
"subSchemas": null
}
],
"dataFrameReaderClass"
and its nested customMetadata
object so that the end of your schema JSON looks like the following:"dataFrameReaderClass": "com.palantir.foundry.spark.input.DataSourceDataFrameReader",
"customMetadata": {
"format": "csv",
"options": {
"multiLine": true,
"header": true,
"mode": "PERMISSIVE"
}
}
}
When exporting datasets from the platform, some files may appear squeezed when opened. This issue has been observed in certain regions and is caused by the default delimiter used in Excel. To fix this issue, you will need to change the delimiter pattern in your export settings: