0

I am using scala spark and I have the following data frame with the following rows. The column_1 has a unique list of array of type string. I also have created a column with size of each array.

+---------------+-------------+
|           list|          len|
+---------------+-------------+
|      [a, b, c]|            3|
|[d, e, f, g, h]|            5|
+---------------+-------------+

I have to slice the column in the above dataframe to a new column where the array leaves out the first element from previous row. Also the size of the array is updated.

+---------------+-------------+
|           list|          len|
+---------------+-------------+
|      [a, b, c]|            3|
|         [b, c]|            2|
|[d, e, f, g, h]|            5|
|   [e, f, g, h]|            4|
|      [f, g, h]|            3|
|         [g, h]|            2|
+---------------+-------------+

I have used the following code to do this. I think I get the desired output but want to optimize it.

val arrayData = Seq((3, List("a", "b", "c")), (5, List("d", "e", "f", "g", "h")))
val df = arrayData.toDF("len", "list")

df.select($"*", posexplode($"list").as(Seq("startIndex", "startValue")))
                .withColumn("newLength", col("len") - col("startIndex"))
                .withColumn("newList", when( col("startIndex") > 0, 
                                            slice($"list", col("startIndex")+1, col("newLength")))
                                      .otherwise(col("list")))

from above code I get the following output

+---+---------------+----------+----------+---------+---------------+
|len|           list|startIndex|startValue|newLength|        newList|
+---+---------------+----------+----------+---------+---------------+
|  3|      [a, b, c]|         0|         a|        3|      [a, b, c]|
|  3|      [a, b, c]|         1|         b|        2|         [b, c]|
|  3|      [a, b, c]|         2|         c|        1|            [c]|
|  5|[d, e, f, g, h]|         0|         d|        5|[d, e, f, g, h]|
|  5|[d, e, f, g, h]|         1|         e|        4|   [e, f, g, h]|
|  5|[d, e, f, g, h]|         2|         f|        3|      [f, g, h]|
|  5|[d, e, f, g, h]|         3|         g|        2|         [g, h]|
|  5|[d, e, f, g, h]|         4|         h|        1|            [h]|
+---+---------------+----------+----------+---------+---------------+

Is there any better way to do this, without multiple new columns or posexplode? as posexplode can be very memory intensive. Any help is appreciated.

1 Answer 1

1

Another way to do this is by operating with arrays:

df = df
.withColumn("list", 
    explode(expr("transform(array_repeat(list, len), (x, i) -> slice(x, -1 - i, i + 1))"))
  )
.withColumn("len", 
    size(col("list"))
  )

Result (I stimulated a case with numbers):

+---------------+---+
|list           |len|
+---------------+---+
|[3]            |1  |
|[2, 3]         |2  |
|[1, 2, 3]      |3  |
|[8]            |1  |
|[7, 8]         |2  |
|[6, 7, 8]      |3  |
|[5, 6, 7, 8]   |4  |
|[4, 5, 6, 7, 8]|5  |
+---------------+---+

Good luck!

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks a lot this works, One more question if I want both the list for example [6, 7, 8][4,5] in the row can I use the following? (x, i) -> (slice(x, -1 - i, i + 1), slice(x, 1, -1 - i))
I am also running into java.lang.NoSuchMethodError: org.antlr.v4.runtime.atn.RuleTransition error while doing this.
You can do: (x, i) -> array(slice(x, -1 - i, i + 1), slice(x, 1, -1 - i)), then you will have an array of arrays, like: [[6,7,8], [4,5]]

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.