0

Hi I am looking to convert the dataframe by grouping the columns separately and adding as new columns to existing dataframe.

ID  qty year    day week
1   2   2016    5   9
2   4   2016    5   9
3   0   2016    5   9
4   4   2016    5   9
5   0   2016    5   9
6   2   2016    5   9
1   0   2016    6   9
2   2   2016    6   9
3   0   2016    6   9
4   0   2016    6   9
5   0   2016    6   9
6   0   2016    6   9
1   0   2016    0   10
2   2   2016    0   10
3   2   2016    0   10
4   2   2016    0   10
5   6   2016    0   10
6   0   2016    0   10
1   0   2016    1   10
2   0   2016    1   10
3   2   2016    1   10
4   0   2016    1   10
5   0   2016    1   10
6   0   2016    1   10
1   0   2016    2   10
2   6   2016    2   10
3   0   2016    2   10
4   4   2016    2   10
5   0   2016    2   10
6   2   2016    2   10
1   0   2016    3   10
2   0   2016    3   10
3   0   2016    3   10
4   4   2016    3   10
5   0   2016    3   10
6   0   2016    3   10
1   0   2016    4   10
2   0   2016    4   10
3   2   2016    4   10
4   4   2016    4   10
5   0   2016    4   10
6   2   2016    4   10
1   4   2016    5   10
2   0   2016    5   10
3   0   2016    5   10
4   8   2016    5   10
5   0   2016    5   10
6   0   2016    5   10
1   0   2016    6   10
2   0   2016    6   10
3   0   2016    6   10
4   6   2016    6   10
5   2   2016    6   10
6   6   2016    6   10
1   0   2020    0   8
2   2   2020    0   8
3   0   2020    0   8
4   0   2020    0   8
5   0   2020    0   8
6   2   2020    0   8
1   0   2020    1   8
2   0   2020    1   8
3   0   2020    1   8
4   0   2020    1   8
5   0   2020    1   8
6   0   2020    1   8
1   0   2020    2   8
2   0   2020    2   8
3   0   2020    2   8
4   0   2020    2   8
5   0   2020    2   8
6   0   2020    2   8
1   0   2020    3   8
2   0   2020    3   8
3   0   2020    3   8
4   0   2020    3   8
5   0   2020    3   8
6   0   2020    3   8
1   0   2020    4   8
2   0   2020    4   8
3   2   2020    4   8
4   0   2020    4   8
5   0   2020    4   8
6   0   2020    4   8
5   2   2020    5   8
6   4   2020    5   8
3   4   2020    6   8
3   4   2020    0   9

I am trying to convert it something like this

ID  qty year    day week    total_Sales_year    total_sales_by_day  total_sales_week
1   2   2016    5   9   78  12  14
2   4   2016    5   9   78  12  14
3   0   2016    5   9   78  12  14
4   4   2016    5   9   78  12  14
5   0   2016    5   9   78  12  14
6   2   2016    5   9   78  12  14
1   0   2016    6   9   78  2   14
2   2   2016    6   9   78  2   14
3   0   2016    6   9   78  2   14
4   0   2016    6   9   78  2   14
5   0   2016    6   9   78  2   14
6   0   2016    6   9   78  2   14
1   0   2016    0   10  78  2   64
2   2   2016    0   10  78  12  64
3   2   2016    0   10  78  12  64
4   2   2016    0   10  78  12  64
5   6   2016    0   10  78  12  64
6   0   2016    0   10  78  12  64
1   0   2016    1   10  78  2   64
2   0   2016    1   10  78  2   64
3   2   2016    1   10  78  2   64
4   0   2016    1   10  78  2   64
5   0   2016    1   10  78  2   64
6   0   2016    1   10  78  2   64
1   0   2016    2   10  78  12  64
2   6   2016    2   10  78  12  64
3   0   2016    2   10  78  12  64
4   4   2016    2   10  78  12  64
5   0   2016    2   10  78  12  64
6   2   2016    2   10  78  12  64
1   0   2016    3   10  78  4   64
2   0   2016    3   10  78  4   64
3   0   2016    3   10  78  4   64
4   4   2016    3   10  78  4   64
5   0   2016    3   10  78  4   64
6   0   2016    3   10  78  4   64
1   0   2016    4   10  78  8   64
2   0   2016    4   10  78  8   64
3   2   2016    4   10  78  8   64
4   4   2016    4   10  78  8   64
5   0   2016    4   10  78  8   64
6   2   2016    4   10  78  8   64
1   4   2016    5   10  78  12  64
2   0   2016    5   10  78  12  64
3   0   2016    5   10  78  12  64
4   8   2016    5   10  78  12  64
5   0   2016    5   10  78  12  64
6   0   2016    5   10  78  12  64
1   0   2016    6   10  78  14  64
2   0   2016    6   10  78  14  64
3   0   2016    6   10  78  14  64
4   6   2016    6   10  78  14  64
5   2   2016    6   10  78  14  64
6   6   2016    6   10  78  14  64
1   0   2020    0   8   20  4   20
2   2   2020    0   8   20  4   20
3   0   2020    0   8   20  4   20
4   0   2020    0   8   20  4   20
5   0   2020    0   8   20  4   20
6   2   2020    0   8   20  4   20
1   0   2020    1   8   20  0   20
2   0   2020    1   8   20  0   20
3   0   2020    1   8   20  0   20
4   0   2020    1   8   20  0   20
5   0   2020    1   8   20  0   20
6   0   2020    1   8   20  0   20
1   0   2020    2   8   20  0   20
2   0   2020    2   8   20  0   20
3   0   2020    2   8   20  0   20
4   0   2020    2   8   20  0   20
5   0   2020    2   8   20  0   20
6   0   2020    2   8   20  0   20
1   0   2020    3   8   20  0   20
2   0   2020    3   8   20  0   20
3   0   2020    3   8   20  0   20
4   0   2020    3   8   20  0   20
5   0   2020    3   8   20  0   20
6   0   2020    3   8   20  0   20
1   0   2020    4   8   20  2   20
2   0   2020    4   8   20  2   20
3   2   2020    4   8   20  2   20
4   0   2020    4   8   20  2   20
5   0   2020    4   8   20  2   20
6   0   2020    4   8   20  2   20
5   2   2020    5   8   20  6   20
6   4   2020    5   8   20  6   20
3   4   2020    6   8   20  4   20
3   4   2020    0   9   20  4   20

I tried doing something like this but this will not get expected dataframe.

df.groupBy("year").agg("qty")

but this will only give one column and won't be able to tie it back to ID Column.

1 Answer 1

1

You can use unbounded Window with only partitionBy clause as shown below or you can specify rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing). I have tested on your data and it works fine. Let me know if it works.

from pyspark.sql import functions as F
from pyspark.sql.window import Window

w1=Window().partitionBy("year")
w2=Window().partitionBy("year","week")
w3=Window().partitionBy("year","week","day")

df.withColumn("total_Sales_year", F.sum("qty").over(w1))\
  .withColumn("total_sales_by_day", F.sum("qty").over(w3))\
  .withColumn("total_sales_week", F.sum("qty").over(w2)).show()

+---+---+----+---+----+----------------+------------------+----------------+
| ID|qty|year|day|week|total_Sales_year|total_sales_by_day|total_sales_week|
+---+---+----+---+----+----------------+------------------+----------------+
|  1|  2|2016|  5|   9|              78|                12|              14|
|  2|  4|2016|  5|   9|              78|                12|              14|
|  3|  0|2016|  5|   9|              78|                12|              14|
|  4|  4|2016|  5|   9|              78|                12|              14|
|  5|  0|2016|  5|   9|              78|                12|              14|
|  6|  2|2016|  5|   9|              78|                12|              14|
|  1|  0|2016|  6|   9|              78|                 2|              14|
|  2|  2|2016|  6|   9|              78|                 2|              14|
|  3|  0|2016|  6|   9|              78|                 2|              14|
|  4|  0|2016|  6|   9|              78|                 2|              14|
|  5|  0|2016|  6|   9|              78|                 2|              14|
|  6|  0|2016|  6|   9|              78|                 2|              14|
|  1|  0|2016|  0|  10|              78|                12|              64|
|  2|  2|2016|  0|  10|              78|                12|              64|
|  3|  2|2016|  0|  10|              78|                12|              64|
|  4|  2|2016|  0|  10|              78|                12|              64|
|  5|  6|2016|  0|  10|              78|                12|              64|
|  6|  0|2016|  0|  10|              78|                12|              64|
|  1|  0|2016|  1|  10|              78|                 2|              64|
|  2|  0|2016|  1|  10|              78|                 2|              64|
+---+---+----+---+----+----------------+------------------+----------------+
only showing top 20 rows

Additionally, if you want a column total_sales_year_to_day you could use this:

from pyspark.sql import functions as F
from pyspark.sql.window import Window

w1=Window().partitionBy("year")
w2=Window().partitionBy("year","week")
w3=Window().partitionBy("year","week","day")
w4=Window().partitionBy("year","week","day").orderBy("day")
w5=Window().partitionBy("year").orderBy("week","day")

df.withColumn("total_Sales_year", F.sum("qty").over(w1))\
  .withColumn("total_sales_by_day", F.sum("qty").over(w3))\
  .withColumn("total_sales_week", F.sum("qty").over(w2))\
  .withColumn("rownum", F.row_number().over(w4))\
  .withColumn("newday", F.when(F.col("rownum")!=1, F.lit(0)).otherwise(F.col("total_sales_by_day")))\
  .withColumn("total_sales_year_to_day", F.sum("newday").over(w5)).drop("rownum","newday").show()





+---+---+----+---+----+----------------+------------------+----------------+-----------------------+
| ID|qty|year|day|week|total_Sales_year|total_sales_by_day|total_sales_week|total_sales_year_to_day|
+---+---+----+---+----+----------------+------------------+----------------+-----------------------+
|  1|  2|2016|  5|   9|              78|                12|              14|                     12|
|  2|  4|2016|  5|   9|              78|                12|              14|                     12|
|  3|  0|2016|  5|   9|              78|                12|              14|                     12|
|  4|  4|2016|  5|   9|              78|                12|              14|                     12|
|  5|  0|2016|  5|   9|              78|                12|              14|                     12|
|  6|  2|2016|  5|   9|              78|                12|              14|                     12|
|  1|  0|2016|  6|   9|              78|                 2|              14|                     14|
|  2|  2|2016|  6|   9|              78|                 2|              14|                     14|
|  3|  0|2016|  6|   9|              78|                 2|              14|                     14|
|  4|  0|2016|  6|   9|              78|                 2|              14|                     14|
|  5|  0|2016|  6|   9|              78|                 2|              14|                     14|
|  6|  0|2016|  6|   9|              78|                 2|              14|                     14|
|  1|  0|2016|  0|  10|              78|                12|              64|                     26|
|  2|  2|2016|  0|  10|              78|                12|              64|                     26|
|  3|  2|2016|  0|  10|              78|                12|              64|                     26|
|  4|  2|2016|  0|  10|              78|                12|              64|                     26|
|  5|  6|2016|  0|  10|              78|                12|              64|                     26|
|  6|  0|2016|  0|  10|              78|                12|              64|                     26|
|  1|  0|2016|  1|  10|              78|                 2|              64|                     28|
|  2|  0|2016|  1|  10|              78|                 2|              64|                     28|
+---+---+----+---+----+----------------+------------------+----------------+-----------------------+
only showing top 20 rows
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.