1

I have a dataframe -

values = [('A',8),('B',7)]
df = sqlContext.createDataFrame(values,['col1','col2'])
df.show()
+----+----+
|col1|col2|
+----+----+
|   A|   8|
|   B|   7|
+----+----+

I want the list of even numbers from 0 till col2.

#Returns even numbers
def make_list(col):
    return list(map(int,[x for x in range(col+1) if x % 2 == 0]))
make_list = udf(make_list)

df = df.withColumn('list',make_list(col('col2')))
df.show()
+----+----+---------------+
|col1|col2|           list|
+----+----+---------------+
|   A|   8|[0, 2, 4, 6, 8]|
|   B|   7|   [0, 2, 4, 6]|
+----+----+---------------+
df.printSchema()
root
 |-- col1: string (nullable = true)
 |-- col2: long (nullable = true)
 |-- list: string (nullable = true)

I get the list I want, but the list is of string type rather than int, as you can see in the printschema above.

How can I get the list of int type? Without int type, I cannot explode this dataframe.

Any ideas as to how can I get a list of integers?

3
  • 1
    If you don't specify the return type of the udf, it will default to StringType Commented Jan 23, 2019 at 18:03
  • By the way, if your end goal is to explode the list, you can also try a variation of the code from this question. Commented Jan 23, 2019 at 19:44
  • Thank you so much Pault for your efforts. I will explore the link. I asked this question as I wanted to solve this problem - stackoverflow.com/questions/54320724/… Commented Jan 23, 2019 at 21:08

2 Answers 2

4

You need to specify the return type of the udf; to get a list of int, use ArrayType(IntegerType()):

from pyspark.sql.functions import udf, col
from pyspark.sql.types import ArrayType, IntegerType

# specify the return type as ArrayType(IntegerType())
make_list_udf = udf(make_list, ArrayType(IntegerType()))

df = df.withColumn('list',make_list_udf(col('col2')))
df.show()
+----+----+------------+                                                        
|col1|col2|        list|
+----+----+------------+
|   A|   6|[0, 2, 4, 6]|
|   B|   7|[0, 2, 4, 6]|
+----+----+------------+

df.printSchema()
root
 |-- col1: string (nullable = true)
 |-- col2: long (nullable = true)
 |-- list: array (nullable = true)
 |    |-- element: integer (containsNull = true)

Or if you are using spark 2.4, you can use the new sequence function:

values = [('A',8),('B',7)]
df = sqlContext.createDataFrame(values,['col1','col2'])

from pyspark.sql.functions import sequence, lit, col
df.withColumn('list', sequence(lit(0), col('col2'), step=lit(2))).show()
+----+----+---------------+
|col1|col2|           list|
+----+----+---------------+
|   A|   8|[0, 2, 4, 6, 8]|
|   B|   7|   [0, 2, 4, 6]|
+----+----+---------------+
Sign up to request clarification or add additional context in comments.

3 Comments

I have some thoughts on how to do this using only the API functions, but in this case I think the udf is preferred.
Feel free to post an API version if you have one in mind though; It may not be what OP is asking for but can still help -
actually I found a reasonable way without a udf
2

As it turns out, there is a closed form function that will get the number that is represented by joining the digits in your desired list column.

We can implement this function and then use some string manipulation and regular expressions to get the desired output using only the API functions. Even though it's more complicated, this should still be faster than using a udf.

import pyspark.sql.functions as f

def getEvenNumList(x):
    n = f.floor(x/2)
    return f.split(
        f.concat(
            f.lit("0,"), 
            f.regexp_replace(
                (2./81.*(-9*n+f.pow(10, (n+1))-10)).cast('int').cast('string'), 
                r"(?<=\d)(?=\d)", 
                ","
            )
        ),
        ","
    ).cast("array<int>")

df = df.withColumn("list", getEvenNumList(f.col("col2")))
df.show()
#+----+----+---------------+
#|col1|col2|           list|
#+----+----+---------------+
#|   A|   8|[0, 2, 4, 6, 8]|
#|   B|   7|   [0, 2, 4, 6]|
#+----+----+---------------+

df.printSchema()
#root
# |-- col1: string (nullable = true)
# |-- col2: long (nullable = true)
# |-- list: array (nullable = true)
# |    |-- element: integer (containsNull = true)

Explanation

The number of elements in your desired list is one plus the floor of col2 divided by 2. (The plus 1 is for the leading 0). Ignore the 0 for now and let n be the floor of col2 divided by 2.

If you joined the numbers in your list together (as you can using str.join), the resulting number would be given by the expression:

2*sum(i*10**(n-i) for i in range(1,n+1))

Using Wolfram Alpha, you can compute a closed form equation for this sum.

Once you have that number, you can convert it into a string add in the leading 0.

Finally I added in a comma as a separator between each of the digits, split the result, and casted it into an array of integers.

2 Comments

I have to say this is really smart. It's going to take people some time to digest though. In spark 2.4, you can actually use sequence to do this to avoid udf. Nonetheless a good solution before spark 2.4.
@Psidom actually this breaks when n>4. sequence is the optimal solution for 2.4+, otherwise use a udf

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.