Dealing with nested JSON in PySpark

Question

I'm loading a JSON file into PySpark:

df = spark.read.json("20220824211022.json")
df.show()

+--------------------+--------------------+--------------------+
|                data|            includes|                meta|
+--------------------+--------------------+--------------------+
|[{961778216070344...|{[{2018-02-09T01:...|{1562543391161741...|
+--------------------+--------------------+--------------------+

The two columns I'm interested in here are data and includes. For data, I ran the following:

df2 = df.withColumn("data", F.explode(F.col("data"))).select("data.*")
df2.show(2)

+-------------------+--------------------+-------------------+--------------+--------------------+
|          author_id|          created_at|                 id|public_metrics|                text|
+-------------------+--------------------+-------------------+--------------+--------------------+
| 961778216070344705|2022-08-24T20:52:...|1562543391161741312|  {0, 0, 0, 2}|With Kaskada, you...|
|1275784834321768451|2022-08-24T20:47:...|1562542031284555777|  {2, 0, 0, 0}|Below is a protot...|
+-------------------+--------------------+-------------------+--------------+--------------------+

Which is something I can work with. However I can't do the same with the includes column as it has the {} enclosing the [].

Is there a way for me to deal with this using PySpark?

EDIT:

If you were to look at the includes sections in the JSON file, it looks like:

"includes": {"users": [{"id": "893899303" .... }, ...]},

So ideally in the first table in my question, I'd want the includes to be users, or at least be able to drill down to users

could you use df3 = df.withColumn("includes", F.explode(F.col("includes").getItem("users"))).select("includes.*")? — Jonathan
– Jonathan, Commented Aug 25, 2022 at 5:26
Thanks! That works. Not quite sure what it is doing though. What the select("includes.*") doing with the star? — TheDataPanda
– TheDataPanda, Commented Aug 25, 2022 at 7:58
It's because after you do the getItem(), it returns a struct type. The select("includes.*") selects all the column inside this struct. Please accept the answer if it helps. — Jonathan
– Jonathan, Commented Aug 25, 2022 at 8:19

Jonathan · Accepted Answer · 2022-08-25 08:18:53Z

1

As your includes column is a MapType with key value = "users", you can use the .getItem() to get the array by the key, that is:

df3 = df.withColumn("includes", F.explode(F.col("includes").getItem("users"))).select("includes.*")

answered Aug 25, 2022 at 8:18

Jonathan

2,3332 gold badges12 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Dealing with nested JSON in PySpark

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related