Groupby column and create lists for another column values in pyspark

Question

I have a data frame as below:

dummy = pd.DataFrame([[1047,2021,0.38],[1056,2021,0.19]],columns=['reco','user','score'])
dummy

reco    user    score
0   1047    2021    0.38
1   1056    2021    0.19

I want the output to look like this:

user    score   reco
2021    [0.38, 0.19]    [1047, 1056]

I want to group by user, and then the lists should be created by score in descending order and the reco should be corresponding to its score values.

I tried collect_list but the order changes. I want to keep the same order.

Does this answer your question? Spark dataframes groupby into list — Daeho Ro
– Daeho Ro, Commented Dec 11, 2021 at 1:47

Nithish · Accepted Answer · 2021-12-10 19:13:22Z

You can preserve ordering by applying collect_list over the window function. In this case the window is partitioned by user and ordered by score descending.

import pandas as pd
from pyspark.sql import functions as F
from pyspark.sql import Window as W

dummy = pd.DataFrame([[1047,2021,0.38],[1056,2021,0.19]],columns=['reco','user','score'])

df = spark.createDataFrame(dummy)

window_spec = W.partitionBy("user").orderBy(F.desc("score"))
ranged_spec = window_spec.rowsBetween(W.unboundedPreceding, W.unboundedFollowing)

df.withColumn("reco", F.collect_list("reco").over(window_spec))\
  .withColumn("score", F.collect_list("score").over(window_spec))\
  .withColumn("rn", F.row_number().over(window_spec))\
  .where("rn == 1")\
  .drop("rn").show()

Output

+------------+----+------------+
|        reco|user|       score|
+------------+----+------------+
|[1047, 1056]|2021|[0.38, 0.19]|
+------------+----+------------+

Collectives™ on Stack Overflow

Groupby column and create lists for another column values in pyspark

1 Answer 1

Output

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Output

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related