2

I got asked to do something in apache spark sql (java api), through dataframes, that I think would cost really a lot if performed following a naive approach (I'm still working in the naive approach but I think it would cost a lot since it would need at least 4 sort of joins).

I got the following dataframe:

+----+----+----+----+----+----------+------+
|  C1|  C2|  C3|  C4|  C5|UNIQUE KEY|points|
+----+----+----+----+----+----------+------+
|   A|   A|null|null|null|      1234|     2|
|   A|null|null|   H|null|      1235|     3|
|   A|   B|null|null|null|      1236|     3|
|   B|null|null|null|   E|      1237|     1|
|   C|null|null|   G|null|      1238|     1|
|   F|null|   C|   E|null|      1239|     2|
|null|null|   D|   E|   G|      1240|     1|
+----+----+----+----+----+----------+------+

C1, C2, C3, C4 and C5 have the same domain values, unique key is a unique key, points is an integer that should be considered only once for each distinct value of its corresponding C columns (e.g., for first row A,A,null,null,null,key,2 is the same of A,null,null,null,null,key,2 or A,A,A,A,null,key,2)

I got asked to "for each existing C value get the total number of points".

So the output should be:

+----+------+
|  C1|points|
+----+------+
|   A|     8|
|   B|     4|
|   C|     3|
|   D|     1|
|   E|     4|
|   F|     2| 
|   G|     2|
|   H|     3|
+----+------+

I'm was going to separate the dataframe in multiple small ones (1 column for a C column and 1 column for the points) through simple .select("C1","point"), .select("C2","point") and so on. But I believe that it would really cost a lot if the amount of data is really big, I believe that there should be some sort of trick through map reduce, but I couldn't find one myself since I'm still new to all this world. I think I'm missing some concepts on how to apply a map reduce.

I thought also about using the function explode, I thought putting together [C1, C2, C3, C4, C5] in a column then using explode so I get 5 rows for each row and then I just group by key... but I believe that this would increase the amount of data at some point and if we are talking about GBs this may not be feasible.... I hope you can find the trick that i'm looking for.

Thanks for your time.

1 Answer 1

1

Using explode would probably be the way to go here. It won't increase the amount of data and would be a lot more computationally effective as compared to using multiple join (note that a single join by itself is an expensive operation).

In this case, you can convert the columns to an array, retaining only the unique values for each separate row. This array can then be exploded and all nulls filtered away. At this point, a simple groupBy and sum will give you the wanted result.

In Scala:

df.select(explode(array_distinct(array("C1", "C2", "C3", "C4", "C5"))).as("C1"), $"points")
  .filter($"C1".isNotNull)
  .groupBy($"C1)
  .agg(sum($"points").as("points"))
  .sort($"C1") // not really necessary

This will give you the wanted result:

+----+------+
|  C1|points|
+----+------+
|   A|     8|
|   B|     4|
|   C|     3|
|   D|     1|
|   E|     4|
|   F|     2| 
|   G|     2|
|   H|     3|
+----+------+
Sign up to request clarification or add additional context in comments.

3 Comments

Is explode won't increase data? I think the rows will be increased but what is the meaning for the amount of data?
@Lamanus: The number of rows will temporarily increase (before the groupby) but the number of columns are fewer (here, from 7 to 2) and the information contained in the table will be the same. Some tips on explode for large data can be found here: stackoverflow.com/questions/52777421/…
I like the idea of array_distinct (i didnt know there was something like that) and it gets a lot better of what i thought (i thought it was going to be around x4-x5 of data) but i'm getting just x2.3 in my case and i just watched the video suggested in the stackoverflow.com/questions/52777421/ I guess thats the best that can be reached and yes i agree it would be more computationally effective. thanks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.