I have a JSON string that I load into a Spark DataFrame. The JSON string can have between 0 and 3 key-value pairs.
When more than one kv pairs are sent, the product_facets is correctly formatted as an array like below:
{"id":1,
"productData":{
"product":{
"product_name":"xyz",
"product_facets":{"entry":[{"key":"test","value":"success"}, {"key": "test2","value" : "fail"}]}
}}}
I can now use the explode function:
sourceDF.filter($"someKey".contains("some_string"))
.select($"id", explode($"productData.product.product_facets.entry") as "kvPairs")
However when only one key value was sent, the source JSON string for entry is not formatted as an array with square braces:
{"id":1,
"productData":{
"product":{
"product_name":"xyz",
"product_facets":{"entry":{"key":"test","value":"success"}}
}}}
The schema for product tag looks like:
| |-- product: struct (nullable = true)
| | |-- product_facets: struct (nullable = true)
| | | |-- entry: string (nullable = true)
| | |-- product_name: string (nullable = true)
How can I change the entry to an array of key value pairs that is compatible with the explode function. My end goal is to pivot the keys into individual columns and I want to use group by on exploding the kv pairs. I tried using from_json but could not get it to work.
val schema =
StructType(
Seq(
StructField("entry", ArrayType(
StructType(
Seq(
StructField("key", StringType),
StructField("value",StringType)
)
)
))
)
)
sourceDF.filter($"someKey".contains("some_string"))
.select($"id", from_json($"productData.product.product_facets.entry", schema) as "kvPairsFromJson")
But the above does creates a new column kvPairsFromJson that looks like "[]" and using explode does nothing.
Any pointers on whats going on or if there is a better way to do this?