pivot one column into multiple columns in Pyspark/Python

Question

I found a similiar situation with mine in this line, but he is using SQL server, not pyspark/python: Pivoting Multiple Columns Based On a Single Column

I have a dateset as below:

ID       Date             Class
1       2021/01/01        math, english
1       2021/01/02        math, english
1       2021/01/03        chinese
1       2021/01/04        math, chemistry
1       2021/01/05        math, english
1       2021/01/06        Chinese
2       2021/01/01        PE
2       2021/01/02        math, chinese
2       2021/01/03        math, english
2       2021/01/04        math, chinese
.......

the desire output should be:

ID       Date_1             schedule_1       Date_2       schedule_2     Date_3      schedule_3
1       2021/01/01        math, english      2021/01/03    chinese      2021/01/05   math, chemistry... 
1       2021/01/02        math, english      2021/01/06    chinese...
1       2021/01/05        math, english....
2       2021/01/01        PE                 2021/01/02    math, chinese     2021/01/03  math, english
2                                            2021/01/04    math, chinese

I am planning using pivot and groupby, this is my current code, which is not working, and I have no idea on how to solve it.

line2 = line\
.select("ID")\
.groupBy("ID","Class")\
    .pivot("Date","Class")\
    .agg(expr("coalesce(first(Class), \" \")"))

Any help or ideas or suggestions will be appreciated.

mck · Accepted Answer · 2021-02-13 17:33:02Z

1

A bit tricky and need some more wrangling:

import pyspark.sql.functions as F
from pyspark.sql import Window

df2 = df.withColumn(
    'rn',
    F.row_number().over(Window.partitionBy('ID', 'Class').orderBy('Date'))
).withColumn(
    'mindate',
    F.min('Date').over(Window.partitionBy('ID', 'Class'))
).withColumn(
    'rn2',
    F.dense_rank().over(Window.partitionBy('ID').orderBy('mindate'))
).groupBy('ID', 'rn').pivot('rn2').agg(
    F.first(F.struct('Date', 'Class'))
).orderBy('ID', 'rn')

df3 = df2.select(
    'ID',
    *[f'{c}.*' for c in df2.columns[2:]]
)

df3.show(truncate=False)
+---+----------+-------------+----------+-------------+----------+---------------+
|ID |Date      |Class        |Date      |Class        |Date      |Class          |
+---+----------+-------------+----------+-------------+----------+---------------+
|1  |2021/01/01|math, english|2021/01/03|chinese      |2021/01/04|math, chemistry|
|1  |2021/01/02|math, english|2021/01/06|chinese      |null      |null           |
|1  |2021/01/05|math, english|null      |null         |null      |null           |
|2  |2021/01/01|PE           |2021/01/02|math, chinese|2021/01/03|math, english  |
|2  |null      |null         |2021/01/04|math, chinese|null      |null           |
+---+----------+-------------+----------+-------------+----------+---------------+

edited Feb 13, 2021 at 17:33

answered Feb 13, 2021 at 7:59

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

yokielove Over a year ago

Hi MCK, thank you for the code, I tried this code, but it looks like it just line up all possible combination of class as columns. I edited my questions a bit, hope that is more clearly. Thank you

mck Over a year ago

@yokielove rewrote my answer completely. See if that works better...?

yokielove Over a year ago

Thank you so much! I understand most of code, just one small question: what is this line [f'{c}.' for c in df2.columns[2:]] meaning or definition?

mck Over a year ago

@yokielove that is a format string, which substitutes the name of each column and adds a .* to it in order to expand the structs

Collectives™ on Stack Overflow

pivot one column into multiple columns in Pyspark/Python

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related