I need to transform a list of JSONs into pySpark DataFrames. The JSON all have the same Schema. The problem is that the value-entries of the dicts in the JSON have different data types.
Example: The field complex is an Array of Dicts, the Dicts has four keys but of different types (Integer, String, Float and a nested Dict). See below for an example JSON.
If I use df = spark.createDataFrame(json_list) to create my DataFrame from the jsons, pyspark "deletes" some of the data as he cannot infer the Schema correctly. PySpark decides
that the Schema of the complex-field should be: StructType("complex", ArrayType(MapType(StringType(), LongType()))) which leads to the non-LongType values being nulled.
I tried to supply a schema, but since I need to set a specific (?) DataType for the value fields of the nested MapType - which is not uniform, but varies...
myschema = StructType([
StructField("Id", StringType(), True),
StructField("name", StringType(), True),
StructField("sentTimestamp", LongType(), True),
StructType("complex", ArrayType(MapType(StringType(), StringType())))
])
The MapType(StringType(), StringType()))) means some value-fields in the dict are being nulled as it cannot be mapped.
It seems that PySpark can only handle dicts if all data types of the values are the same.
How can I convert the JSON to a pyspark DataFrame without loosing data?
[{
"Id": "2345123",
"name": "something",
"sentTimestamp": 1646732402,
"complex":
[
{
"key1": 1,
"key2": "(1)",
"key3": 0.5,
"key4":
{
"innerkey1": "random",
"innerkey2": 5.4,
"innerkey3": 1
}
},
{
"key1": 2,
"key2": "(2)",
"key3": 0.5,
"key4":
{
"innerkey1": "left",
"innerkey2": 7.8,
"innerkey3": 1
}
}
]
}]