1

I'm loading a JSON file into PySpark:

df = spark.read.json("20220824211022.json")
df.show()
+--------------------+--------------------+--------------------+
|                data|            includes|                meta|
+--------------------+--------------------+--------------------+
|[{961778216070344...|{[{2018-02-09T01:...|{1562543391161741...|
+--------------------+--------------------+--------------------+

The two columns I'm interested in here are data and includes. For data, I ran the following:

df2 = df.withColumn("data", F.explode(F.col("data"))).select("data.*")
df2.show(2)
+-------------------+--------------------+-------------------+--------------+--------------------+
|          author_id|          created_at|                 id|public_metrics|                text|
+-------------------+--------------------+-------------------+--------------+--------------------+
| 961778216070344705|2022-08-24T20:52:...|1562543391161741312|  {0, 0, 0, 2}|With Kaskada, you...|
|1275784834321768451|2022-08-24T20:47:...|1562542031284555777|  {2, 0, 0, 0}|Below is a protot...|
+-------------------+--------------------+-------------------+--------------+--------------------+

Which is something I can work with. However I can't do the same with the includes column as it has the {} enclosing the [].

Is there a way for me to deal with this using PySpark?

EDIT:

If you were to look at the includes sections in the JSON file, it looks like:

"includes": {"users": [{"id": "893899303" .... }, ...]},

So ideally in the first table in my question, I'd want the includes to be users, or at least be able to drill down to users

3
  • could you use df3 = df.withColumn("includes", F.explode(F.col("includes").getItem("users"))).select("includes.*")? Commented Aug 25, 2022 at 5:26
  • Thanks! That works. Not quite sure what it is doing though. What the select("includes.*") doing with the star? Commented Aug 25, 2022 at 7:58
  • It's because after you do the getItem(), it returns a struct type. The select("includes.*") selects all the column inside this struct. Please accept the answer if it helps. Commented Aug 25, 2022 at 8:19

1 Answer 1

1

As your includes column is a MapType with key value = "users", you can use the .getItem() to get the array by the key, that is:

df3 = df.withColumn("includes", F.explode(F.col("includes").getItem("users"))).select("includes.*")
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.