transform rows and column without using pandas

Question

I have a dataframe with only two columns in it. I am trying to convert the values of one column to headers and the other column to its values. tried using pivot and all but it is not working.

df_pivot_test = sc.parallelize([('a',1), ('b',1), ('c',2), ('d',2), ('e',10)]).toDF(["id","score"])

id  score
a   1
b   1
c   3
d   6
e   10

trying to convert this into

a   b   c   d   e
1   1   3   6   10

any thoughts on how we can do this? I don't want to use .toPandas() we can achieve it by converting into pandas dataframe. but we have billions of rows because of which we will run into memory issues.

notNull · Accepted Answer · 2019-09-30 18:59:55Z

1

You can do pivot and groupBy to get your desired result.

Try with this method:

from pyspark.sql.functions import *

# with literal value in groupby clause

df_pivot_test.groupBy(lit(1)).pivot("id").agg(expr("first(score)")).drop("1").show()

(or)

# without any column in groupby clause
df_pivot_test.groupBy().pivot("id").agg(expr("first(score)")).show()

Result:

+---+---+---+---+---+
|  a|  b|  c|  d|  e|
+---+---+---+---+---+
|  1|  1|  2|  2| 10|
+---+---+---+---+---+

answered Sep 30, 2019 at 18:59

notNull

31.8k4 gold badges41 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

transform rows and column without using pandas

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related