Parsing a messy json inside a dataframe column

Ask Question

Asked 3 years, 7 months ago

Modified 3 years, 7 months ago

Viewed 61 times

PROBLEM DESCRIPTION

Let´s say i've a got a pyspark dataframe, one of its columns is a messy json (not in standard form, single level but with double quotes and #)

+---------+--------------------+--------------------+
|DeskIndex|            MetaData|    ContentMediaType|
+---------+--------------------+--------------------+
|    False|"{""#stateName"":...|          text/plain|
|    False|"{""#stateName"":...|          text/plain|
|    False|"{""channel"":""C...|          text/plain|
|    False|"{""channel"":""C...|          text/plain|
|    False|"{""#stateName"":...|          text/plain|
|    False|"{""#stateName"":...|          text/plain|
|    False|"{""#stateName"":...|application/qnd.l...|
|    False|"{""#stateName"":...|          text/plain|
|    False|"{""#stateName"":...|          text/plain|
|    False|"{""#blip.payload...|          text/plain|
|    False|"{""#stateName"":...|          text/plain|
|    False|"{""channel"":""C...|          text/plain|
|    False|"{""#stateName"":...|          text/plain|
|    False|"{""#blip.payload...|          text/plain|
|    False|"{""#blip.payload...|          text/plain|
|    False|"{""#stateName"":...|application/qnd.l...|
|    False|"{""#stateName"":...|application/qnd.l...|
|    False|"{""channel"":""C...|          text/plain|
|    False|"{""#stateName"":...|application/qnd.l...|
|    False|"{""#stateName"":...|application/qnd.l...|
+---------+--------------------+--------------------+

Example of row in metadata column:

 "{""#stateName"":""[K.9.2.3]bar"",""#stateId"":""thebar"",""#messageId"":""data:43ff-89e6-62aa235df798"",""#previousStateId"":""80b7-4a5978a03a47"",""#previousStateName"":""[K.9.2.2]foo "",""trace-id"":""84499a247d3%3A1"",""#uniqueId"":""42-9625-58da4a2cbd1f"",""date_created"":""2227"",""#tunnel.originalFrom"":""host-7"",""#tunnel.originalTo"":""[email protected]"",""$originator"":""[email protected]"",""$claims"":""[email protected]/[email protected]=MemberScheme=Transport"",""$internalId"":""58da4a2cbd1f""}"

Is there a way to break this json data in multiple columns?

WHAT I`VE TRIED SO FAR

1 - Most solutions i´ve found require me to declare a schema. But since i don´t the order of the data in each json, how should i work it out?

2- According to solution here, PySpark: Read nested JSON from a String Type Column and create columns,

under the following code i was able to separate, but all columns returned null

df0 = df1.select(F.get_json_object(F.col('MetaData'), "$.#stateName").alias("State Name"),
          F.get_json_object(F.col('MetaData'), "$.#stateId").alias("State ID"),
          F.get_json_object(F.col('MetaData'), "$.#messageId").alias("Message Id"),
          F.get_json_object(F.col('MetaData'), "$.#previousStateId").alias("previousStateId"),
          F.get_json_object(F.col('MetaData'), "$.#previousStateName").alias("previousStateName"))

df0.show()
+----------+--------+----------+---------------+-----------------+
|State Name|State ID|Message Id|previousStateId|previousStateName|
+----------+--------+----------+---------------+-----------------+
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
+----------+--------+----------+---------------+-----------------+

edited May 3, 2022 at 17:43

asked May 3, 2022 at 15:38

hrk_kr

114 bronze badges

If you can manage to remove the double quoutes " from the json elements then it is fairly simple to parse. You essentially need to use from_json and with MapType you can define the key-value pairs, which you can later explode to get individual columns per your use case.

Dipanjan Mallick
– Dipanjan Mallick

2022-05-03 17:10:13 +00:00
Commented May 3, 2022 at 17:10

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Parsing a messy json inside a dataframe column

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked