I got asked to do something in apache spark sql (java api), through dataframes, that I think would cost really a lot if performed following a naive approach (I'm still working in the naive approach but I think it would cost a lot since it would need at least 4 sort of joins).
I got the following dataframe:
+----+----+----+----+----+----------+------+
| C1| C2| C3| C4| C5|UNIQUE KEY|points|
+----+----+----+----+----+----------+------+
| A| A|null|null|null| 1234| 2|
| A|null|null| H|null| 1235| 3|
| A| B|null|null|null| 1236| 3|
| B|null|null|null| E| 1237| 1|
| C|null|null| G|null| 1238| 1|
| F|null| C| E|null| 1239| 2|
|null|null| D| E| G| 1240| 1|
+----+----+----+----+----+----------+------+
C1, C2, C3, C4 and C5 have the same domain values, unique key is a unique key, points is an integer that should be considered only once for each distinct value of its corresponding C columns (e.g., for first row A,A,null,null,null,key,2 is the same of A,null,null,null,null,key,2 or A,A,A,A,null,key,2)
I got asked to "for each existing C value get the total number of points".
So the output should be:
+----+------+
| C1|points|
+----+------+
| A| 8|
| B| 4|
| C| 3|
| D| 1|
| E| 4|
| F| 2|
| G| 2|
| H| 3|
+----+------+
I'm was going to separate the dataframe in multiple small ones (1 column for a C column and 1 column for the points) through simple .select("C1","point"), .select("C2","point") and so on. But I believe that it would really cost a lot if the amount of data is really big, I believe that there should be some sort of trick through map reduce, but I couldn't find one myself since I'm still new to all this world. I think I'm missing some concepts on how to apply a map reduce.
I thought also about using the function explode, I thought putting together [C1, C2, C3, C4, C5] in a column then using explode so I get 5 rows for each row and then I just group by key... but I believe that this would increase the amount of data at some point and if we are talking about GBs this may not be feasible.... I hope you can find the trick that i'm looking for.
Thanks for your time.