Avoiding multiple joins in a specific case with multiple columns with the same domain in dataframes of apache spark sql

Question

I got asked to do something in apache spark sql (java api), through dataframes, that I think would cost really a lot if performed following a naive approach (I'm still working in the naive approach but I think it would cost a lot since it would need at least 4 sort of joins).

I got the following dataframe:

+----+----+----+----+----+----------+------+
|  C1|  C2|  C3|  C4|  C5|UNIQUE KEY|points|
+----+----+----+----+----+----------+------+
|   A|   A|null|null|null|      1234|     2|
|   A|null|null|   H|null|      1235|     3|
|   A|   B|null|null|null|      1236|     3|
|   B|null|null|null|   E|      1237|     1|
|   C|null|null|   G|null|      1238|     1|
|   F|null|   C|   E|null|      1239|     2|
|null|null|   D|   E|   G|      1240|     1|
+----+----+----+----+----+----------+------+

C1, C2, C3, C4 and C5 have the same domain values, unique key is a unique key, points is an integer that should be considered only once for each distinct value of its corresponding C columns (e.g., for first row A,A,null,null,null,key,2 is the same of A,null,null,null,null,key,2 or A,A,A,A,null,key,2)

I got asked to "for each existing C value get the total number of points".

So the output should be:

+----+------+
|  C1|points|
+----+------+
|   A|     8|
|   B|     4|
|   C|     3|
|   D|     1|
|   E|     4|
|   F|     2| 
|   G|     2|
|   H|     3|
+----+------+

I'm was going to separate the dataframe in multiple small ones (1 column for a C column and 1 column for the points) through simple .select("C1","point"), .select("C2","point") and so on. But I believe that it would really cost a lot if the amount of data is really big, I believe that there should be some sort of trick through map reduce, but I couldn't find one myself since I'm still new to all this world. I think I'm missing some concepts on how to apply a map reduce.

I thought also about using the function explode, I thought putting together [C1, C2, C3, C4, C5] in a column then using explode so I get 5 rows for each row and then I just group by key... but I believe that this would increase the amount of data at some point and if we are talking about GBs this may not be feasible.... I hope you can find the trick that i'm looking for.

Thanks for your time.

Shaido · Accepted Answer · 2020-03-10 02:34:49Z

1

Using explode would probably be the way to go here. It won't increase the amount of data and would be a lot more computationally effective as compared to using multiple join (note that a single join by itself is an expensive operation).

In this case, you can convert the columns to an array, retaining only the unique values for each separate row. This array can then be exploded and all nulls filtered away. At this point, a simple groupBy and sum will give you the wanted result.

In Scala:

df.select(explode(array_distinct(array("C1", "C2", "C3", "C4", "C5"))).as("C1"), $"points")
  .filter($"C1".isNotNull)
  .groupBy($"C1)
  .agg(sum($"points").as("points"))
  .sort($"C1") // not really necessary

This will give you the wanted result:

+----+------+
|  C1|points|
+----+------+
|   A|     8|
|   B|     4|
|   C|     3|
|   D|     1|
|   E|     4|
|   F|     2| 
|   G|     2|
|   H|     3|
+----+------+

answered Mar 10, 2020 at 2:34

Shaido

28.6k26 gold badges76 silver badges82 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Daeho Ro Over a year ago

Is explode won't increase data? I think the rows will be increased but what is the meaning for the amount of data?

Shaido Over a year ago

@Lamanus: The number of rows will temporarily increase (before the groupby) but the number of columns are fewer (here, from 7 to 2) and the information contained in the table will be the same. Some tips on explode for large data can be found here: stackoverflow.com/questions/52777421/…

Steven Salazar Molina Over a year ago

I like the idea of array_distinct (i didnt know there was something like that) and it gets a lot better of what i thought (i thought it was going to be around x4-x5 of data) but i'm getting just x2.3 in my case and i just watched the video suggested in the stackoverflow.com/questions/52777421/ I guess thats the best that can be reached and yes i agree it would be more computationally effective. thanks

Collectives™ on Stack Overflow

Avoiding multiple joins in a specific case with multiple columns with the same domain in dataframes of apache spark sql

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related