1

PROBLEM DESCRIPTION

Let´s say i've a got a pyspark dataframe, one of its columns is a messy json (not in standard form, single level but with double quotes and #)

+---------+--------------------+--------------------+
|DeskIndex|            MetaData|    ContentMediaType|
+---------+--------------------+--------------------+
|    False|"{""#stateName"":...|          text/plain|
|    False|"{""#stateName"":...|          text/plain|
|    False|"{""channel"":""C...|          text/plain|
|    False|"{""channel"":""C...|          text/plain|
|    False|"{""#stateName"":...|          text/plain|
|    False|"{""#stateName"":...|          text/plain|
|    False|"{""#stateName"":...|application/qnd.l...|
|    False|"{""#stateName"":...|          text/plain|
|    False|"{""#stateName"":...|          text/plain|
|    False|"{""#blip.payload...|          text/plain|
|    False|"{""#stateName"":...|          text/plain|
|    False|"{""channel"":""C...|          text/plain|
|    False|"{""#stateName"":...|          text/plain|
|    False|"{""#blip.payload...|          text/plain|
|    False|"{""#blip.payload...|          text/plain|
|    False|"{""#stateName"":...|application/qnd.l...|
|    False|"{""#stateName"":...|application/qnd.l...|
|    False|"{""channel"":""C...|          text/plain|
|    False|"{""#stateName"":...|application/qnd.l...|
|    False|"{""#stateName"":...|application/qnd.l...|
+---------+--------------------+--------------------+

Example of row in metadata column:

 "{""#stateName"":""[K.9.2.3]bar"",""#stateId"":""thebar"",""#messageId"":""data:43ff-89e6-62aa235df798"",""#previousStateId"":""80b7-4a5978a03a47"",""#previousStateName"":""[K.9.2.2]foo "",""trace-id"":""84499a247d3%3A1"",""#uniqueId"":""42-9625-58da4a2cbd1f"",""date_created"":""2227"",""#tunnel.originalFrom"":""host-7"",""#tunnel.originalTo"":""[email protected]"",""$originator"":""[email protected]"",""$claims"":""[email protected]/[email protected]=MemberScheme=Transport"",""$internalId"":""58da4a2cbd1f""}" 

Is there a way to break this json data in multiple columns?

WHAT I`VE TRIED SO FAR

1 - Most solutions i´ve found require me to declare a schema. But since i don´t the order of the data in each json, how should i work it out?

2- According to solution here, PySpark: Read nested JSON from a String Type Column and create columns,

under the following code i was able to separate, but all columns returned null

df0 = df1.select(F.get_json_object(F.col('MetaData'), "$.#stateName").alias("State Name"),
          F.get_json_object(F.col('MetaData'), "$.#stateId").alias("State ID"),
          F.get_json_object(F.col('MetaData'), "$.#messageId").alias("Message Id"),
          F.get_json_object(F.col('MetaData'), "$.#previousStateId").alias("previousStateId"),
          F.get_json_object(F.col('MetaData'), "$.#previousStateName").alias("previousStateName"))

df0.show()
+----------+--------+----------+---------------+-----------------+
|State Name|State ID|Message Id|previousStateId|previousStateName|
+----------+--------+----------+---------------+-----------------+
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
|      null|    null|      null|           null|             null|
+----------+--------+----------+---------------+-----------------+
1
  • If you can manage to remove the double quoutes " from the json elements then it is fairly simple to parse. You essentially need to use from_json and with MapType you can define the key-value pairs, which you can later explode to get individual columns per your use case. Commented May 3, 2022 at 17:10

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.