4

I have a dataframe with 1 column of type integer.

I want to create a new column with an array containing n elements (n being the # from the first column)

For example:

x = spark.createDataFrame([(1,), (2,),],StructType([ StructField("myInt", IntegerType(), True)])) 

+-----+
|myInt|
+-----+
|    1|
|    2|
|    3|
+-----+

I need the resulting data frame to look like this:

+-----+---------+
|myInt|    myArr|
+-----+---------+
|    1|      [1]|
|    2|   [2, 2]|
|    3|[3, 3, 3]|
+-----+---------+

Note, It doesn't actually matter what the values inside of the arrays are, it's just the count that matters.

It'd be fine if the resulting data frame looked like this:

+-----+------------------+
|myInt|             myArr|
+-----+------------------+
|    1|            [item]|
|    2|      [item, item]|
|    3|[item, item, item]|
+-----+------------------+

2 Answers 2

2

It is preferable to avoid UDFs if possible because they are less efficient. You can use array_repeat instead.

import pyspark.sql.functions as F

x.withColumn('myArr', F.array_repeat(F.col('myInt'), F.col('myInt'))).show()

+-----+---------+
|myInt|    myArr|
+-----+---------+
|    1|      [1]|
|    2|   [2, 2]|
|    3|[3, 3, 3]|
+-----+---------+
Sign up to request clarification or add additional context in comments.

1 Comment

Note that I had some issues with it in spark 2.4.4 but it works fine in spark 3.0.1
1

Use udf:

from pyspark.sql.functions import *

@udf("array<int>")
def rep_(x):
    return [x for _ in range(x)]

x.withColumn("myArr", rep_("myInt")).show()
# +-----+------+
# |myInt| myArr|
# +-----+------+
# |    1|   [1]|
# |    2|[2, 2]|
# +-----+------+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.