Storing values of multiples columns in pyspark dataframe under a new column

Question

I am importing data from a csv file where I have columns Reading1 and Reading2 and storing it into a pyspark dataframe. My objective is to have a new column name Reading and its value as a array containing values of Reading1 and Reading2. How can I achieve the same in pyspark.

        +---+-----------+-----------+
        | id|  Reading A|  Reading B| 
        +---+-----------------------+
        |01 |  0.123    |   0.145   | 
        |02 |  0.546    |   0.756   |
        +---+-----------+-----------+

        Desired Output:
        +---+------------------+
        | id|    Reading       |
        +---+------------------+
        |01 |  [0.123, 0.145]  |
        |02 |  [0.546, 0.756   |
        +---+------------------+-

Use an array or struct.

pault
– pault

2019-09-21 23:18:33 +00:00
Commented Sep 21, 2019 at 23:18 — pault
– pault, Commented Sep 21, 2019 at 23:18
spark.apache.org/docs/2.4.0/api/python/…

jxc
– jxc

2019-09-21 23:28:41 +00:00
Commented Sep 21, 2019 at 23:28 — jxc
– jxc, Commented Sep 21, 2019 at 23:28

kranthi kumar · Accepted Answer · 2019-09-22 11:11:34Z

1

try this

import pyspark.sql.functions as f

df.withColumn('reading',f.array([f.col("reading a"), f.col("reading b")]))

answered Sep 22, 2019 at 11:11

kranthi kumar

1651 silver badge9 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Saikat Over a year ago

after using this solution I am not getting any error from spark but no new columns are added to the existing df I have also tried below but same result df.withColumn('reading',f.array([lit(df.readingA),lit(df.readingB))

kranthi kumar Over a year ago

yes, the above code adds new column to existing df. you don't want that ?

Saikat Over a year ago

The solution is working fine now. Earlier it was just returning me an empty dataframe so I posted as I it was not reflecting even when i was not getting any error.

Collectives™ on Stack Overflow

Storing values of multiples columns in pyspark dataframe under a new column

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related