I am using scala spark and I have the following data frame with the following rows. The column_1 has a unique list of array of type string. I also have created a column with size of each array.
+---------------+-------------+
| list| len|
+---------------+-------------+
| [a, b, c]| 3|
|[d, e, f, g, h]| 5|
+---------------+-------------+
I have to slice the column in the above dataframe to a new column where the array leaves out the first element from previous row. Also the size of the array is updated.
+---------------+-------------+
| list| len|
+---------------+-------------+
| [a, b, c]| 3|
| [b, c]| 2|
|[d, e, f, g, h]| 5|
| [e, f, g, h]| 4|
| [f, g, h]| 3|
| [g, h]| 2|
+---------------+-------------+
I have used the following code to do this. I think I get the desired output but want to optimize it.
val arrayData = Seq((3, List("a", "b", "c")), (5, List("d", "e", "f", "g", "h")))
val df = arrayData.toDF("len", "list")
df.select($"*", posexplode($"list").as(Seq("startIndex", "startValue")))
.withColumn("newLength", col("len") - col("startIndex"))
.withColumn("newList", when( col("startIndex") > 0,
slice($"list", col("startIndex")+1, col("newLength")))
.otherwise(col("list")))
from above code I get the following output
+---+---------------+----------+----------+---------+---------------+
|len| list|startIndex|startValue|newLength| newList|
+---+---------------+----------+----------+---------+---------------+
| 3| [a, b, c]| 0| a| 3| [a, b, c]|
| 3| [a, b, c]| 1| b| 2| [b, c]|
| 3| [a, b, c]| 2| c| 1| [c]|
| 5|[d, e, f, g, h]| 0| d| 5|[d, e, f, g, h]|
| 5|[d, e, f, g, h]| 1| e| 4| [e, f, g, h]|
| 5|[d, e, f, g, h]| 2| f| 3| [f, g, h]|
| 5|[d, e, f, g, h]| 3| g| 2| [g, h]|
| 5|[d, e, f, g, h]| 4| h| 1| [h]|
+---+---------------+----------+----------+---------+---------------+
Is there any better way to do this, without multiple new columns or posexplode? as posexplode can be very memory intensive. Any help is appreciated.