-2

I have this data frame that has a schema with a map like below:

root
 |-- events: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

When I explode it or use map_keys() to obtain those values I get this dataframe below:

+--------------------+--------------------+
|            map_data|          map_values|
+--------------------+--------------------+
|[[{event_name=walk..|[{event_name=walk...|
|[[{event_name=walk..|          2019-02-17|
|[[{event_name=walk..|            08:00:00|
|[[{event_name=run...|[{event_name=walk...|
|[[{event_name=fly...|          2019-02-17|
|[[{event_name=run...|            09:00:00|
+--------------------+--------------------+

This is my code to get to the dataframe show above:

events = event_data\
   .withColumn(
      "map_data", 
      F.map_values(event_data.events)
   )
events.printSchema()
events.select("map_data")
   .withColumn(
      "map_values", 
      F.explode(events.map_data)
   ).show(10)

From what I started with, I would consider this a milestone reached, however, I would like my data frame to look like this:

+--------------------+-----------+--------+
|          events    |     date  |   time |
+--------------------+-----------+--------+
|[{event_name=walk...| 2019-02-17|08:00:00|
|[{event_name=walk...| 2019-02-17|09:00:00|
+--------------------+-----------+--------+

I have been researching and I have seen that people are utilizing udf's, however, I am sure there is a way to accomplish what I want purely with dataframes and sql functions.

For more insight here is how my rows look like when without .show(truncate=False)

+--------------------+--------------------+
|            map_data|          map_values|
+--------------------+--------------------+
|[[{event_name=walk..|[{event_name=walk, duration=0.47, x=0.39, y=0.14, timestamp=08:02:30.574892}, {event_name=walk, duration=0.77, x=0.15, y=0.08, timestamp=08:02:50.330245}, {event_name=run, duration=0.02, x=0.54, y=0.44, timestamp=08:02:22.737803}, {event_name=run, duration=0.01, x=0.43, y=0.56, timestamp=08:02:11.629404}, {event_name=run, duration=0.03, x=0.57, y=0.4, timestamp=08:02:22.660778}, {event_name=run, duration=0.02, x=0.49, y=0.49, timestamp=08:02:56.660186}]|
|[[{event_name=walk..|          2019-02-17|
|[[{event_name=walk..|            08:00:00|

Also, with the dataframe I have now, my issue here is to find out how to explode an array into multiple columns. I mention this cause I can either work with that or perform a more efficient process to create the dataframe based on the map I was given.

1
  • could u provide a complete view of the first events column using .show(truncate=False) Commented Apr 29, 2020 at 23:01

1 Answer 1

0

I have found a solution to my problem. I needed to go about this approach (Create a dataframe from a hashmap with keys as column names and values as rows in Spark) and perform these series of computation on event_data which is my initialized dataframe.

This is how my dataframe looks now

|25769803776|2019-03-19|[{event_name=walk, duration=0.47, x=0.39, y=0.14, timestamp=08:02:30.574892}, {event_name=walk, duration=0.77, x=0.15, y=0.08, timestamp=08:02:50.330245}, {event_name=run, duration=0.02, x=0.54, y=0.44, timestamp=08:02:22.737803}, {event_name=run, duration=0.01, x=0.43, y=0.56, timestamp=08:02:11.629404}, {event_name=run, duration=0.03, x=0.57, y=0.4, timestamp=08:02:22.660778}, {event_name=run, duration=0.02, x=0.49, y=0.49, timestamp=08:02:56.660186}]|08:02:00|
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.