I have this data frame that has a schema with a map like below:
root
|-- events: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
When I explode it or use map_keys() to obtain those values I get this dataframe below:
+--------------------+--------------------+
| map_data| map_values|
+--------------------+--------------------+
|[[{event_name=walk..|[{event_name=walk...|
|[[{event_name=walk..| 2019-02-17|
|[[{event_name=walk..| 08:00:00|
|[[{event_name=run...|[{event_name=walk...|
|[[{event_name=fly...| 2019-02-17|
|[[{event_name=run...| 09:00:00|
+--------------------+--------------------+
This is my code to get to the dataframe show above:
events = event_data\
.withColumn(
"map_data",
F.map_values(event_data.events)
)
events.printSchema()
events.select("map_data")
.withColumn(
"map_values",
F.explode(events.map_data)
).show(10)
From what I started with, I would consider this a milestone reached, however, I would like my data frame to look like this:
+--------------------+-----------+--------+
| events | date | time |
+--------------------+-----------+--------+
|[{event_name=walk...| 2019-02-17|08:00:00|
|[{event_name=walk...| 2019-02-17|09:00:00|
+--------------------+-----------+--------+
I have been researching and I have seen that people are utilizing udf's, however, I am sure there is a way to accomplish what I want purely with dataframes and sql functions.
For more insight here is how my rows look like when without .show(truncate=False)
+--------------------+--------------------+
| map_data| map_values|
+--------------------+--------------------+
|[[{event_name=walk..|[{event_name=walk, duration=0.47, x=0.39, y=0.14, timestamp=08:02:30.574892}, {event_name=walk, duration=0.77, x=0.15, y=0.08, timestamp=08:02:50.330245}, {event_name=run, duration=0.02, x=0.54, y=0.44, timestamp=08:02:22.737803}, {event_name=run, duration=0.01, x=0.43, y=0.56, timestamp=08:02:11.629404}, {event_name=run, duration=0.03, x=0.57, y=0.4, timestamp=08:02:22.660778}, {event_name=run, duration=0.02, x=0.49, y=0.49, timestamp=08:02:56.660186}]|
|[[{event_name=walk..| 2019-02-17|
|[[{event_name=walk..| 08:00:00|
Also, with the dataframe I have now, my issue here is to find out how to explode an array into multiple columns. I mention this cause I can either work with that or perform a more efficient process to create the dataframe based on the map I was given.
eventscolumn using.show(truncate=False)