How to zip/concat value and list in pyspark

Question

I work on a function working with 4 imputs.

To do so, I would like to get a list summarizing these 4 elements. However I have two variables where the data is unique and two variables composed of lists. I can zip the two lists with arrays_zip, but I can't get an array list with the 4 elements :

+----+----+---------+---------+
| l1 | l2 |   l3    |   l4    |
+----+----+---------+---------+
| 1  | 5  | [1,2,3] | [2,2,2] |
| 2  | 9  | [8,2,7] | [1,7,7] |
| 3  | 3  | [8,4,9] | [5,1,3] |
| 4  | 1  | [5,5,3] | [8,4,3] |

What I want to get :


+----+----+---------+---------+------------------------------------------+
| l1 | l2 |   l3    |   l4    |                       l5                 |
+----+----+---------+---------+------------------------------------------+
| 1  | 5  | [1,2,3] | [2,2,2] | [[1, 5, 1, 2],[1, 5, 2, 2],[1, 5, 3, 2]] |
| 2  | 9  | [8,2,7] | [1,7,7] | [[2, 9, 8, 1],[2, 9, 2, 7],[2, 9, 7, 7]] |
| 3  | 3  | [8,4,9] | [5,1,3] | [[3, 3, 8, 5],[3 ,3, 4, 1],[3, 3, 9, 3]] |
| 4  | 1  | [5,5,3] | [8,4,3] | [[4, 1, 5, 8],[4, 1, 5, 4],[4, 1, 3, 3]] |

My idea was to transform l1 and l2 in list with the l3 size, and apply then the arrays_zip. I did,'t found a consistent way to create this list.

As long as I obtained this list of list, I would apply a function as follow:

def is_good(data):
  a,b,c,d = data
  return a+b+c+d

is_good_udf = f.udf(lambda x: is_good(x), ArrayType(FloatType()))

spark.udf.register("is_good_udf ", is_good, T.FloatType())

My guess would be to build something like this, thanks to @kafels, where for each rows and each list of the list, it applies the function :

df.withColumn("tot", f.expr("transform(l5, y -> is_good_udf(y))"))

In order to obtain a list of results as [9, 10, 11] for the first row for instance.

Kafels · Accepted Answer · 2021-07-02 14:53:45Z

5

You can use expr function and apply TRANSFORM:

import pyspark.sql.functions as f


df = df.withColumn('l5', f.expr("""TRANSFORM(arrays_zip(l3, l4), el -> array(l1, l2, el.l3, el.l4))"""))

# +---+---+---------+---------+------------------------------------------+
# |l1 |l2 |l3       |l4       |l5                                        |
# +---+---+---------+---------+------------------------------------------+
# |1  |5  |[1, 2, 3]|[2, 2, 2]|[[1, 5, 1, 2], [1, 5, 2, 2], [1, 5, 3, 2]]|
# |2  |9  |[8, 2, 7]|[1, 7, 7]|[[2, 9, 8, 1], [2, 9, 2, 7], [2, 9, 7, 7]]|
# |3  |3  |[8, 4, 9]|[5, 1, 3]|[[3, 3, 8, 5], [3, 3, 4, 1], [3, 3, 9, 3]]|
# |4  |1  |[5, 5, 3]|[8, 4, 3]|[[4, 1, 5, 8], [4, 1, 5, 4], [4, 1, 3, 3]]|
# +---+---+---------+---------+------------------------------------------+

answered Jul 2, 2021 at 14:53

Kafels

4,0891 gold badge18 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Kafels Over a year ago

@AlexGermain Take a look in Databricks documentation

Collectives™ on Stack Overflow

How to zip/concat value and list in pyspark

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related