0

I found a similiar situation with mine in this line, but he is using SQL server, not pyspark/python: Pivoting Multiple Columns Based On a Single Column

I have a dateset as below:

ID       Date             Class
1       2021/01/01        math, english
1       2021/01/02        math, english
1       2021/01/03        chinese
1       2021/01/04        math, chemistry
1       2021/01/05        math, english
1       2021/01/06        Chinese
2       2021/01/01        PE
2       2021/01/02        math, chinese
2       2021/01/03        math, english
2       2021/01/04        math, chinese
.......

the desire output should be:

ID       Date_1             schedule_1       Date_2       schedule_2     Date_3      schedule_3
1       2021/01/01        math, english      2021/01/03    chinese      2021/01/05   math, chemistry... 
1       2021/01/02        math, english      2021/01/06    chinese...
1       2021/01/05        math, english....
2       2021/01/01        PE                 2021/01/02    math, chinese     2021/01/03  math, english
2                                            2021/01/04    math, chinese

I am planning using pivot and groupby, this is my current code, which is not working, and I have no idea on how to solve it.

line2 = line\
.select("ID")\
.groupBy("ID","Class")\
    .pivot("Date","Class")\
    .agg(expr("coalesce(first(Class), \" \")"))

Any help or ideas or suggestions will be appreciated.

1 Answer 1

1

A bit tricky and need some more wrangling:

import pyspark.sql.functions as F
from pyspark.sql import Window

df2 = df.withColumn(
    'rn',
    F.row_number().over(Window.partitionBy('ID', 'Class').orderBy('Date'))
).withColumn(
    'mindate',
    F.min('Date').over(Window.partitionBy('ID', 'Class'))
).withColumn(
    'rn2',
    F.dense_rank().over(Window.partitionBy('ID').orderBy('mindate'))
).groupBy('ID', 'rn').pivot('rn2').agg(
    F.first(F.struct('Date', 'Class'))
).orderBy('ID', 'rn')

df3 = df2.select(
    'ID',
    *[f'{c}.*' for c in df2.columns[2:]]
)

df3.show(truncate=False)
+---+----------+-------------+----------+-------------+----------+---------------+
|ID |Date      |Class        |Date      |Class        |Date      |Class          |
+---+----------+-------------+----------+-------------+----------+---------------+
|1  |2021/01/01|math, english|2021/01/03|chinese      |2021/01/04|math, chemistry|
|1  |2021/01/02|math, english|2021/01/06|chinese      |null      |null           |
|1  |2021/01/05|math, english|null      |null         |null      |null           |
|2  |2021/01/01|PE           |2021/01/02|math, chinese|2021/01/03|math, english  |
|2  |null      |null         |2021/01/04|math, chinese|null      |null           |
+---+----------+-------------+----------+-------------+----------+---------------+
Sign up to request clarification or add additional context in comments.

4 Comments

Hi MCK, thank you for the code, I tried this code, but it looks like it just line up all possible combination of class as columns. I edited my questions a bit, hope that is more clearly. Thank you
@yokielove rewrote my answer completely. See if that works better...?
Thank you so much! I understand most of code, just one small question: what is this line [f'{c}.' for c in df2.columns[2:]] meaning or definition?
@yokielove that is a format string, which substitutes the name of each column and adds a .* to it in order to expand the structs

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.