0

I am importing data from a csv file where I have columns Reading1 and Reading2 and storing it into a pyspark dataframe. My objective is to have a new column name Reading and its value as a array containing values of Reading1 and Reading2. How can I achieve the same in pyspark.

        +---+-----------+-----------+
        | id|  Reading A|  Reading B| 
        +---+-----------------------+
        |01 |  0.123    |   0.145   | 
        |02 |  0.546    |   0.756   |
        +---+-----------+-----------+

        Desired Output:
        +---+------------------+
        | id|    Reading       |
        +---+------------------+
        |01 |  [0.123, 0.145]  |
        |02 |  [0.546, 0.756   |
        +---+------------------+-

2

1 Answer 1

1

try this

import pyspark.sql.functions as f

df.withColumn('reading',f.array([f.col("reading a"), f.col("reading b")]))

Sign up to request clarification or add additional context in comments.

3 Comments

after using this solution I am not getting any error from spark but no new columns are added to the existing df I have also tried below but same result df.withColumn('reading',f.array([lit(df.readingA),lit(df.readingB))
yes, the above code adds new column to existing df. you don't want that ?
The solution is working fine now. Earlier it was just returning me an empty dataframe so I posted as I it was not reflecting even when i was not getting any error.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.