Please I'm new to Spark (Stackoverflow as well). For the following RDD and DataFrame (same data) I want to get the most viewed tags of playlists with over N videos. My issue is that tags are in an array, in addition I don't know where to start as it seems advanced.
RDD
(id,playlist,tags,videos,views)
(1,playlist_1,[t1, t2, t3],9,200)
(2,playlist_2,[t4, t5, t7],64,793)
(3,playlist_3,[t4, t6, t3],51,114)
(4,playlist_4,[t1, t6, t2],8,115)
(5,playlist_5,[t1, t6, t2],51,256)
(2,playlist_6,[t4, t5, t2],66,553)
(3|playlist_7,[t4, t6, t2],77,462)
DataFrame
+---+------------+--------------+--------+-------+
| id| playlist | tags | videos | views |
+---+------------+--------------+--------+-------+
| 1 | playlist_1 | [t1, t2, t3] | 9 | 200 |
| 2 | playlist_2 | [t4, t5, t7] | 64 | 793 |
| 3 | playlist_3 | [t4, t6, t3] | 51 | 114 |
| 4 | playlist_4 | [t1, t6, t2] | 8 | 115 |
| 5 | playlist_5 | [t1, t6, t2] | 51 | 256 |
| 2 | playlist_6 | [t4, t5, t2] | 66 | 553 |
| 3 | playlist_7 | [t4, t6, t2] | 77 | 462 |
+---+-------------+-------------+--------+-------+
Expected Result
Tags for playlists with more than (N = 65) videos
+-----+-------+
| tag | views |
+-----+-------+
| t2 | 1015 |
| t4 | 1015 |
| t5 | 553 |
| t6 | 462 |
+-----+-------+
most viewed tags of playlists with over N videos) is somewhat vague, an example would help resolve any ambiguity.