0

I have two rows with the exact same data but columns changing between those two rows:

id product class cost
1 table large 5.12
1 table medium 2.20

so I'm trying to get the following:

id product class cost
1 table large, Medium 7.32

I'm currently using the following code to get this:

df.groupBy("id", "product").agg(collect_list("class"),
    (
       F.sum("cost")
    ).alias("Sum") 

The issue with this snippet code is that when doing the grouping is the first value it finds in class, and the addition doesn't seem to be correct (I'm not sure if it because is getting the first value and adding it the times it encounters class on that same id throughout the rows), so I'm getting something like this

id product class cost
1 table large, large 10.24

this is another snippet code I used, so I could get all my other fields while performing the addition on those two columns:

df.withColumn("total", F.sum("cost").over(Window.partitionBy("id")))

will it be the same to apply the F.array_join() function ?

1 Answer 1

1

You need to use the array_join function to join the results of collect_list with commas (,).

df = df.groupBy('id', 'product').agg(
    F.array_join(F.collect_list('class'), ',').alias('class'),
    F.sum('cost').alias('cost')
)
Sign up to request clarification or add additional context in comments.

1 Comment

I update my original post. I also tried what you post, for some reason in my data it only takes the first value it encounters. so instead of listing (large, medium) as in the example. It list (large, large) for all the additions is doing even if the class is actually different. is that because of the logic of the code?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.