Suppose I have a PySpark dataframe whose df.printSchema() is:
root
|-- shop_id: int (nullable = false)
|-- items: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- item_id: int (nullable = false)
How can one convert it into this:
root
|-- shop_id: int (nullable = false)
|-- item_id: int (nullable = false)
In other words, within each entry shop_id is "attached" to every item_id and these pairs are directed into a single stream.
A more visual explanation:
before
[
{
"shop_id":42,
"items":[{"item_id":101}, {"item_id":102}]
},
{
"shop_id":43,
"items":[{"item_id":203}]
}
]
after
[
{"shop_id":42,"item_id":101},
{"shop_id":42,"item_id":102},
{"shop_id":43,"item_id":203}
]