I'm loading a JSON file into PySpark:
df = spark.read.json("20220824211022.json")
df.show()
+--------------------+--------------------+--------------------+
| data| includes| meta|
+--------------------+--------------------+--------------------+
|[{961778216070344...|{[{2018-02-09T01:...|{1562543391161741...|
+--------------------+--------------------+--------------------+
The two columns I'm interested in here are data and includes. For data, I ran the following:
df2 = df.withColumn("data", F.explode(F.col("data"))).select("data.*")
df2.show(2)
+-------------------+--------------------+-------------------+--------------+--------------------+
| author_id| created_at| id|public_metrics| text|
+-------------------+--------------------+-------------------+--------------+--------------------+
| 961778216070344705|2022-08-24T20:52:...|1562543391161741312| {0, 0, 0, 2}|With Kaskada, you...|
|1275784834321768451|2022-08-24T20:47:...|1562542031284555777| {2, 0, 0, 0}|Below is a protot...|
+-------------------+--------------------+-------------------+--------------+--------------------+
Which is something I can work with. However I can't do the same with the includes column as it has the {} enclosing the [].
Is there a way for me to deal with this using PySpark?
EDIT:
If you were to look at the includes sections in the JSON file, it looks like:
"includes": {"users": [{"id": "893899303" .... }, ...]},
So ideally in the first table in my question, I'd want the includes to be users, or at least be able to drill down to users
df3 = df.withColumn("includes", F.explode(F.col("includes").getItem("users"))).select("includes.*")?select("includes.*")doing with the star?getItem(), it returns a struct type. Theselect("includes.*")selects all the column inside this struct. Please accept the answer if it helps.