I have a dataframe like this:
df = spark.createDataFrame([(0, ["B","C","D","E"]),(1,["E","A","C"]),(2, ["F","A","E","B"]),(3,["E","G","A"]),(4,["A","C","E","B","D"])], ["id","items"])
which creates a data frame df like this:
+---+-----------------+
| 0| [B, C, D, E]|
| 1| [E, A, C]|
| 2| [F, A, E, B]|
| 3| [E, G, A]|
| 4| [A, C, E, B, D]|
+---+-----------------+
I would like to get a result like this:
+---+-----+
|all|count|
+---+-----+
| F| 1|
| E| 5|
| B| 3|
| D| 2|
| C| 3|
| A| 4|
| G| 1|
+---+-----+
Which essentially just finds all distinct elements in df["items"] and counts their frequency. If my data was of a more manageable size, I would just do this:
all_items = df.select(explode("items").alias("all"))
result = all_items.groupby(all_items.all).count().distinct()
result.show()
But because my data has millions of rows and thousands of elements in each list, this is not an option. I was thinking of doing this row by row, so that I only work with 2 lists at a time. Because most elements are frequently repeated in many rows (but the list in each row is a set), this approach should solve my problem. But the problem is, I don't really know how to do this in Spark, as I've only just started learning it. Could anyone help, please?
explodeorflatMapwon't increase amount of data so there should be no problem with this approach (if the example pipeline is correctdistinctis obsolete). Performance of individual functions might be, but otherwise it is the way to go,But because my data has millions of rows and thousands of elements in each list, this is not an option.? Are you getting a specific error when you try to use explode? Explode is the appropriate option here, so if you're getting an error we probably need to look at tuning other areas of your config.