PySpark groupBy and aggregation functions with multiple columns

Question

I work with a spark Dataframe and I try to create a new table with aggregation using groupby : My data example :

and this is the desired result :

I tried this code data.groupBy("id1").agg(countDistinct("id2").alias("id2"), sum("value").alias("value"))

Anyone can help please ? Thank you

Dipanjan Mallick · Accepted Answer · 2022-03-04 15:25:03Z

1

Try using below code -

from pyspark.sql.functions import *

df = spark.createDataFrame([('id11', 'id21', 1), ('id11', 'id22', 2), ('id11', 'id23', 3), ('id12', 'id21', 2), ('id12', 'id23', 1), ('id13', 'id23', 2), ('id13', 'id21', 8)], ["id1", "id2","value"])

Aggregated Data -

df.groupBy("id1").agg(count("id2"),sum("value")).show()

Output -

+----+----------+----------+
| id1|count(id2)|sum(value)|
+----+----------+----------+
|id11|         3|         6|
|id12|         2|         3|
|id13|         2|        10|
+----+----------+----------+

answered Mar 4, 2022 at 15:25

Dipanjan Mallick

1,7792 gold badges10 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Doracahl · Accepted Answer · 2022-08-25 20:54:05Z

0

Here's a solution of how to groupBy with multiple columns using PySpark:

import pyspark.sql.functions as F
from pyspark.sql.functions import col

df.groupBy("id1").agg(F.count(col("id2")).alias('id2_count'),                     
                      F.sum(col('value')).alias("value_sum")).show()

answered Aug 25, 2022 at 20:54

Doracahl

5424 silver badges20 bronze badges

Collectives™ on Stack Overflow

PySpark groupBy and aggregation functions with multiple columns

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related