pyspark. zip arrays in a dataframe

Question

I have the following PySpark DataFrame:

+------+----------------+
|    id|          data  |
+------+----------------+
|     1|    [10, 11, 12]|
|     2|    [20, 21, 22]|
|     3|    [30, 31, 32]|
+------+----------------+

At the end, I want to have the following DataFrame

+--------+----------------------------------+
|    id  |          data                    |
+--------+----------------------------------+
| [1,2,3]|[[10,20,30],[11,21,31],[12,22,32]]|
+--------+----------------------------------+

I order to do this. First I extract the data arrays as follow:

tmp_array = df_test.select("data").rdd.flatMap(lambda x: x).collect()
a0 = tmp_array[0]
a1 = tmp_array[1]
a2 = tmp_array[2]
samples = zip(a0, a1, a2)
samples1 = sc.parallelize(samples)

In this way, I have in samples1 an RDD with the content

[[10,20,30],[11,21,31],[12,22,32]]

Question 1: Is that a good way to do it?
Question 2: How to include that RDD back into the dataframe?

Ok and so you want to get back just one row? The length of data is a constant right (I don't see how it works otherwise)? — pault
– pault, Commented Apr 12, 2018 at 16:03
The length of data could be different, but for now I will consider that is a constant — jmlero
– jmlero, Commented Apr 12, 2018 at 16:06

pault · Accepted Answer · 2018-04-12 16:32:43Z

3

Here is a way to get your desired output without serializing to rdd or using a udf. You will need two constants:

The number of rows in your DataFrame (df.count())
The length of data (given)

Use pyspark.sql.functions.collect_list() and pyspark.sql.functions.array() in a double list comprehension to pick out the elements of "data" in the order you want using pyspark.sql.Column.getItem():

import pyspark.sql.functions as f
dataLength = 3
numRows = df.count()
df.select(
    f.collect_list("id").alias("id"),
    f.array(
        [
            f.array(
                [f.collect_list("data").getItem(j).getItem(i) 
                 for j in range(numRows)]
            ) 
            for i in range(dataLength)
        ]
    ).alias("data")
)\
.show(truncate=False)
#+---------+------------------------------------------------------------------------------+
#|id       |data                                                                          |
#+---------+------------------------------------------------------------------------------+
#|[1, 2, 3]|[WrappedArray(10, 20, 30), WrappedArray(11, 21, 31), WrappedArray(12, 22, 32)]|
#+---------+------------------------------------------------------------------------------+

edited Apr 12, 2018 at 16:32

answered Apr 12, 2018 at 16:24

pault

43.7k17 gold badges121 silver badges161 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Vikrant Singh Rana Over a year ago

This is awesome!

pault · Accepted Answer · 2018-04-12 16:37:52Z

2

You can simply use a udf function for the zip function but before that you will have to use collect_list function

from pyspark.sql import functions as f
from pyspark.sql import types as t
def zipUdf(array):
    return zip(*array)

zipping = f.udf(zipUdf, t.ArrayType(t.ArrayType(t.IntegerType())))

df.select(
    f.collect_list(df.id).alias('id'), 
    zipping(f.collect_list(df.data)).alias('data')
).show(truncate=False)

which would give you

+---------+------------------------------------------------------------------------------+
|id       |data                                                                          |
+---------+------------------------------------------------------------------------------+
|[1, 2, 3]|[WrappedArray(10, 20, 30), WrappedArray(11, 21, 31), WrappedArray(12, 22, 32)]|
+---------+------------------------------------------------------------------------------+

edited Apr 12, 2018 at 16:37

pault

43.7k17 gold badges121 silver badges161 bronze badges

answered Apr 12, 2018 at 15:47

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

7 Comments

jmlero Over a year ago

Usually I have more than 2. Thanks

pault Over a year ago

You can generalize this to more than 2 rows by changing your udf to return zip(*array)

pault Over a year ago

@RameshMaharjan I hope you don't mind, but I edited the answer to just have the final answer, which works for all cases.

Anahcolus Over a year ago

I wanted to do it this way as well but I wanted to put your name so I just added in it @pault thanks a lot

jmlero Over a year ago

@RameshMaharjan I have the following error: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for builtins.iter) Do you know what could be?

|

Collectives™ on Stack Overflow

pyspark. zip arrays in a dataframe

2 Answers 2

1 Comment

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related