How to get incremental sublists or subarray from an existing array in spark?

Question

I am using scala spark and I have the following data frame with the following rows. The column_1 has a unique list of array of type string. I also have created a column with size of each array.

+---------------+-------------+
|           list|          len|
+---------------+-------------+
|      [a, b, c]|            3|
|[d, e, f, g, h]|            5|
+---------------+-------------+

I have to slice the column in the above dataframe to a new column where the array leaves out the first element from previous row. Also the size of the array is updated.

+---------------+-------------+
|           list|          len|
+---------------+-------------+
|      [a, b, c]|            3|
|         [b, c]|            2|
|[d, e, f, g, h]|            5|
|   [e, f, g, h]|            4|
|      [f, g, h]|            3|
|         [g, h]|            2|
+---------------+-------------+

I have used the following code to do this. I think I get the desired output but want to optimize it.

val arrayData = Seq((3, List("a", "b", "c")), (5, List("d", "e", "f", "g", "h")))
val df = arrayData.toDF("len", "list")

df.select($"*", posexplode($"list").as(Seq("startIndex", "startValue")))
                .withColumn("newLength", col("len") - col("startIndex"))
                .withColumn("newList", when( col("startIndex") > 0, 
                                            slice($"list", col("startIndex")+1, col("newLength")))
                                      .otherwise(col("list")))

from above code I get the following output

+---+---------------+----------+----------+---------+---------------+
|len|           list|startIndex|startValue|newLength|        newList|
+---+---------------+----------+----------+---------+---------------+
|  3|      [a, b, c]|         0|         a|        3|      [a, b, c]|
|  3|      [a, b, c]|         1|         b|        2|         [b, c]|
|  3|      [a, b, c]|         2|         c|        1|            [c]|
|  5|[d, e, f, g, h]|         0|         d|        5|[d, e, f, g, h]|
|  5|[d, e, f, g, h]|         1|         e|        4|   [e, f, g, h]|
|  5|[d, e, f, g, h]|         2|         f|        3|      [f, g, h]|
|  5|[d, e, f, g, h]|         3|         g|        2|         [g, h]|
|  5|[d, e, f, g, h]|         4|         h|        1|            [h]|
+---+---------------+----------+----------+---------+---------------+

Is there any better way to do this, without multiple new columns or posexplode? as posexplode can be very memory intensive. Any help is appreciated.

vilalabinot · Accepted Answer · 2022-08-10 08:21:44Z

1

Another way to do this is by operating with arrays:

df = df
.withColumn("list", 
    explode(expr("transform(array_repeat(list, len), (x, i) -> slice(x, -1 - i, i + 1))"))
  )
.withColumn("len", 
    size(col("list"))
  )

Result (I stimulated a case with numbers):

+---------------+---+
|list           |len|
+---------------+---+
|[3]            |1  |
|[2, 3]         |2  |
|[1, 2, 3]      |3  |
|[8]            |1  |
|[7, 8]         |2  |
|[6, 7, 8]      |3  |
|[5, 6, 7, 8]   |4  |
|[4, 5, 6, 7, 8]|5  |
+---------------+---+

Good luck!

answered Aug 10, 2022 at 8:21

vilalabinot

1,6216 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user3388770 Over a year ago

Thanks a lot this works, One more question if I want both the list for example [6, 7, 8][4,5] in the row can I use the following? (x, i) -> (slice(x, -1 - i, i + 1), slice(x, 1, -1 - i))

user3388770 Over a year ago

I am also running into java.lang.NoSuchMethodError: org.antlr.v4.runtime.atn.RuleTransition error while doing this.

vilalabinot Over a year ago

You can do: (x, i) -> array(slice(x, -1 - i, i + 1), slice(x, 1, -1 - i)), then you will have an array of arrays, like: [[6,7,8], [4,5]]

Collectives™ on Stack Overflow

How to get incremental sublists or subarray from an existing array in spark?

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related