2

I have a data frame as below:

dummy = pd.DataFrame([[1047,2021,0.38],[1056,2021,0.19]],columns=['reco','user','score'])
dummy

reco    user    score
0   1047    2021    0.38
1   1056    2021    0.19

I want the output to look like this:

user    score   reco
2021    [0.38, 0.19]    [1047, 1056]

I want to group by user, and then the lists should be created by score in descending order and the reco should be corresponding to its score values.

I tried collect_list but the order changes. I want to keep the same order.

1

1 Answer 1

2

You can preserve ordering by applying collect_list over the window function. In this case the window is partitioned by user and ordered by score descending.

import pandas as pd
from pyspark.sql import functions as F
from pyspark.sql import Window as W

dummy = pd.DataFrame([[1047,2021,0.38],[1056,2021,0.19]],columns=['reco','user','score'])

df = spark.createDataFrame(dummy)

window_spec = W.partitionBy("user").orderBy(F.desc("score"))
ranged_spec = window_spec.rowsBetween(W.unboundedPreceding, W.unboundedFollowing)

df.withColumn("reco", F.collect_list("reco").over(window_spec))\
  .withColumn("score", F.collect_list("score").over(window_spec))\
  .withColumn("rn", F.row_number().over(window_spec))\
  .where("rn == 1")\
  .drop("rn").show()

Output

+------------+----+------------+
|        reco|user|       score|
+------------+----+------------+
|[1047, 1056]|2021|[0.38, 0.19]|
+------------+----+------------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.