0

I work with a spark Dataframe and I try to create a new table with aggregation using groupby : My data example : enter image description here

and this is the desired result : enter image description here

I tried this code data.groupBy("id1").agg(countDistinct("id2").alias("id2"), sum("value").alias("value"))

Anyone can help please ? Thank you

2 Answers 2

1

Try using below code -

from pyspark.sql.functions import *

df = spark.createDataFrame([('id11', 'id21', 1), ('id11', 'id22', 2), ('id11', 'id23', 3), ('id12', 'id21', 2), ('id12', 'id23', 1), ('id13', 'id23', 2), ('id13', 'id21', 8)], ["id1", "id2","value"])

Aggregated Data -

df.groupBy("id1").agg(count("id2"),sum("value")).show()

Output -

+----+----------+----------+
| id1|count(id2)|sum(value)|
+----+----------+----------+
|id11|         3|         6|
|id12|         2|         3|
|id13|         2|        10|
+----+----------+----------+
Sign up to request clarification or add additional context in comments.

Comments

0

Here's a solution of how to groupBy with multiple columns using PySpark:

import pyspark.sql.functions as F
from pyspark.sql.functions import col

df.groupBy("id1").agg(F.count(col("id2")).alias('id2_count'),                     
                      F.sum(col('value')).alias("value_sum")).show()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.