PySpark - List created in dataframe column is of type String instead of Integer

Question

I have a dataframe -

values = [('A',8),('B',7)]
df = sqlContext.createDataFrame(values,['col1','col2'])
df.show()
+----+----+
|col1|col2|
+----+----+
|   A|   8|
|   B|   7|
+----+----+

I want the list of even numbers from 0 till col2.

#Returns even numbers
def make_list(col):
    return list(map(int,[x for x in range(col+1) if x % 2 == 0]))
make_list = udf(make_list)

df = df.withColumn('list',make_list(col('col2')))
df.show()
+----+----+---------------+
|col1|col2|           list|
+----+----+---------------+
|   A|   8|[0, 2, 4, 6, 8]|
|   B|   7|   [0, 2, 4, 6]|
+----+----+---------------+
df.printSchema()
root
 |-- col1: string (nullable = true)
 |-- col2: long (nullable = true)
 |-- list: string (nullable = true)

I get the list I want, but the list is of string type rather than int, as you can see in the printschema above.

How can I get the list of int type? Without int type, I cannot explode this dataframe.

Any ideas as to how can I get a list of integers?

If you don't specify the return type of the udf, it will default to StringType — pault
– pault, Commented Jan 23, 2019 at 18:03
By the way, if your end goal is to explode the list, you can also try a variation of the code from this question. — pault
– pault, Commented Jan 23, 2019 at 19:44
Thank you so much Pault for your efforts. I will explore the link. I asked this question as I wanted to solve this problem - stackoverflow.com/questions/54320724/… — cph_sto
– cph_sto, Commented Jan 23, 2019 at 21:08

akuiper · Accepted Answer · 2019-01-23 19:30:09Z

4

You need to specify the return type of the udf; to get a list of int, use ArrayType(IntegerType()):

from pyspark.sql.functions import udf, col
from pyspark.sql.types import ArrayType, IntegerType

# specify the return type as ArrayType(IntegerType())
make_list_udf = udf(make_list, ArrayType(IntegerType()))

df = df.withColumn('list',make_list_udf(col('col2')))
df.show()
+----+----+------------+                                                        
|col1|col2|        list|
+----+----+------------+
|   A|   6|[0, 2, 4, 6]|
|   B|   7|[0, 2, 4, 6]|
+----+----+------------+

df.printSchema()
root
 |-- col1: string (nullable = true)
 |-- col2: long (nullable = true)
 |-- list: array (nullable = true)
 |    |-- element: integer (containsNull = true)

Or if you are using spark 2.4, you can use the new sequence function:

values = [('A',8),('B',7)]
df = sqlContext.createDataFrame(values,['col1','col2'])

from pyspark.sql.functions import sequence, lit, col
df.withColumn('list', sequence(lit(0), col('col2'), step=lit(2))).show()
+----+----+---------------+
|col1|col2|           list|
+----+----+---------------+
|   A|   8|[0, 2, 4, 6, 8]|
|   B|   7|   [0, 2, 4, 6]|
+----+----+---------------+

edited Jan 23, 2019 at 19:30

answered Jan 23, 2019 at 17:32

akuiper

216k33 gold badges363 silver badges380 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

pault Over a year ago

I have some thoughts on how to do this using only the API functions, but in this case I think the udf is preferred.

akuiper Over a year ago

Feel free to post an API version if you have one in mind though; It may not be what OP is asking for but can still help -

pault Over a year ago

actually I found a reasonable way without a udf

pault · Accepted Answer · 2019-01-23 19:01:55Z

2

As it turns out, there is a closed form function that will get the number that is represented by joining the digits in your desired list column.

We can implement this function and then use some string manipulation and regular expressions to get the desired output using only the API functions. Even though it's more complicated, this should still be faster than using a udf.

import pyspark.sql.functions as f

def getEvenNumList(x):
    n = f.floor(x/2)
    return f.split(
        f.concat(
            f.lit("0,"), 
            f.regexp_replace(
                (2./81.*(-9*n+f.pow(10, (n+1))-10)).cast('int').cast('string'), 
                r"(?<=\d)(?=\d)", 
                ","
            )
        ),
        ","
    ).cast("array<int>")

df = df.withColumn("list", getEvenNumList(f.col("col2")))
df.show()
#+----+----+---------------+
#|col1|col2|           list|
#+----+----+---------------+
#|   A|   8|[0, 2, 4, 6, 8]|
#|   B|   7|   [0, 2, 4, 6]|
#+----+----+---------------+

df.printSchema()
#root
# |-- col1: string (nullable = true)
# |-- col2: long (nullable = true)
# |-- list: array (nullable = true)
# |    |-- element: integer (containsNull = true)

Explanation

The number of elements in your desired list is one plus the floor of col2 divided by 2. (The plus 1 is for the leading 0). Ignore the 0 for now and let n be the floor of col2 divided by 2.

If you joined the numbers in your list together (as you can using str.join), the resulting number would be given by the expression:

2*sum(i*10**(n-i) for i in range(1,n+1))

Using Wolfram Alpha, you can compute a closed form equation for this sum.

Once you have that number, you can convert it into a string add in the leading 0.

Finally I added in a comma as a separator between each of the digits, split the result, and casted it into an array of integers.

answered Jan 23, 2019 at 19:01

pault

43.7k17 gold badges121 silver badges161 bronze badges

2 Comments

akuiper Over a year ago

I have to say this is really smart. It's going to take people some time to digest though. In spark 2.4, you can actually use sequence to do this to avoid udf. Nonetheless a good solution before spark 2.4.

pault Over a year ago

@Psidom actually this breaks when n>4. sequence is the optimal solution for 2.4+, otherwise use a udf

Collectives™ on Stack Overflow

PySpark - List created in dataframe column is of type String instead of Integer

2 Answers 2

3 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related