2

I work on a function working with 4 imputs.

To do so, I would like to get a list summarizing these 4 elements. However I have two variables where the data is unique and two variables composed of lists. I can zip the two lists with arrays_zip, but I can't get an array list with the 4 elements :

+----+----+---------+---------+
| l1 | l2 |   l3    |   l4    |
+----+----+---------+---------+
| 1  | 5  | [1,2,3] | [2,2,2] |
| 2  | 9  | [8,2,7] | [1,7,7] |
| 3  | 3  | [8,4,9] | [5,1,3] |
| 4  | 1  | [5,5,3] | [8,4,3] |

What I want to get :


+----+----+---------+---------+------------------------------------------+
| l1 | l2 |   l3    |   l4    |                       l5                 |
+----+----+---------+---------+------------------------------------------+
| 1  | 5  | [1,2,3] | [2,2,2] | [[1, 5, 1, 2],[1, 5, 2, 2],[1, 5, 3, 2]] |
| 2  | 9  | [8,2,7] | [1,7,7] | [[2, 9, 8, 1],[2, 9, 2, 7],[2, 9, 7, 7]] |
| 3  | 3  | [8,4,9] | [5,1,3] | [[3, 3, 8, 5],[3 ,3, 4, 1],[3, 3, 9, 3]] |
| 4  | 1  | [5,5,3] | [8,4,3] | [[4, 1, 5, 8],[4, 1, 5, 4],[4, 1, 3, 3]] |

My idea was to transform l1 and l2 in list with the l3 size, and apply then the arrays_zip. I did,'t found a consistent way to create this list.

As long as I obtained this list of list, I would apply a function as follow:

def is_good(data):
  a,b,c,d = data
  return a+b+c+d

is_good_udf = f.udf(lambda x: is_good(x), ArrayType(FloatType()))

spark.udf.register("is_good_udf ", is_good, T.FloatType())

My guess would be to build something like this, thanks to @kafels, where for each rows and each list of the list, it applies the function :

df.withColumn("tot", f.expr("transform(l5, y -> is_good_udf(y))"))

In order to obtain a list of results as [9, 10, 11] for the first row for instance.

1 Answer 1

5

You can use expr function and apply TRANSFORM:

import pyspark.sql.functions as f


df = df.withColumn('l5', f.expr("""TRANSFORM(arrays_zip(l3, l4), el -> array(l1, l2, el.l3, el.l4))"""))

# +---+---+---------+---------+------------------------------------------+
# |l1 |l2 |l3       |l4       |l5                                        |
# +---+---+---------+---------+------------------------------------------+
# |1  |5  |[1, 2, 3]|[2, 2, 2]|[[1, 5, 1, 2], [1, 5, 2, 2], [1, 5, 3, 2]]|
# |2  |9  |[8, 2, 7]|[1, 7, 7]|[[2, 9, 8, 1], [2, 9, 2, 7], [2, 9, 7, 7]]|
# |3  |3  |[8, 4, 9]|[5, 1, 3]|[[3, 3, 8, 5], [3, 3, 4, 1], [3, 3, 9, 3]]|
# |4  |1  |[5, 5, 3]|[8, 4, 3]|[[4, 1, 5, 8], [4, 1, 5, 4], [4, 1, 3, 3]]|
# +---+---+---------+---------+------------------------------------------+
Sign up to request clarification or add additional context in comments.

1 Comment

@AlexGermain Take a look in Databricks documentation

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.