I work on a function working with 4 imputs.
To do so, I would like to get a list summarizing these 4 elements. However I have two variables where the data is unique and two variables composed of lists. I can zip the two lists with arrays_zip, but I can't get an array list with the 4 elements :
+----+----+---------+---------+
| l1 | l2 | l3 | l4 |
+----+----+---------+---------+
| 1 | 5 | [1,2,3] | [2,2,2] |
| 2 | 9 | [8,2,7] | [1,7,7] |
| 3 | 3 | [8,4,9] | [5,1,3] |
| 4 | 1 | [5,5,3] | [8,4,3] |
What I want to get :
+----+----+---------+---------+------------------------------------------+
| l1 | l2 | l3 | l4 | l5 |
+----+----+---------+---------+------------------------------------------+
| 1 | 5 | [1,2,3] | [2,2,2] | [[1, 5, 1, 2],[1, 5, 2, 2],[1, 5, 3, 2]] |
| 2 | 9 | [8,2,7] | [1,7,7] | [[2, 9, 8, 1],[2, 9, 2, 7],[2, 9, 7, 7]] |
| 3 | 3 | [8,4,9] | [5,1,3] | [[3, 3, 8, 5],[3 ,3, 4, 1],[3, 3, 9, 3]] |
| 4 | 1 | [5,5,3] | [8,4,3] | [[4, 1, 5, 8],[4, 1, 5, 4],[4, 1, 3, 3]] |
My idea was to transform l1 and l2 in list with the l3 size, and apply then the arrays_zip. I did,'t found a consistent way to create this list.
As long as I obtained this list of list, I would apply a function as follow:
def is_good(data):
a,b,c,d = data
return a+b+c+d
is_good_udf = f.udf(lambda x: is_good(x), ArrayType(FloatType()))
spark.udf.register("is_good_udf ", is_good, T.FloatType())
My guess would be to build something like this, thanks to @kafels, where for each rows and each list of the list, it applies the function :
df.withColumn("tot", f.expr("transform(l5, y -> is_good_udf(y))"))
In order to obtain a list of results as [9, 10, 11] for the first row for instance.