PySpark DataFrame groupby into list of values?

Question

Simply, let's say I had the following DataFrame:

+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|        James|     Sales|  3000|
|      Michael|     Sales|  4600|
|       Robert|     Sales|  4100|
|        Maria|   Finance|  3000|
|        James|     Sales|  3000|
|        Scott|   Finance|  3300|
|          Jen|   Finance|  3900|
|         Jeff| Marketing|  3000|
|        Kumar| Marketing|  2000|
|         Saif|     Sales|  4100|
+-------------+----------+------+

How could I group by department and get all other values into a list, as follows:

department	employee_name	salary
Sales	[James, Michael, Robert, James, Saif]	[3000, 4600, 4100, 3000, 4100]
Finance	[Maria, Scott, Jen]	[3000, 3300, 3900]
Marketing	[Jeff, Kumar]	[3000, 2000]

spark.apache.org/docs/latest/api/sql/index.html#collect_list — David דודו Markovitz
– David דודו Markovitz, Commented Mar 17, 2022 at 20:22

notNull · Accepted Answer · 2022-03-17 20:23:59Z

4

Use collect_list with groupBy clause

from pyspark.sql.functions import *

df.groupBy(col("department")).agg(collect_list(col("employee_name")).alias("employee_name"),collect_list(col("employee_name")).alias("salary"))

answered Mar 17, 2022 at 20:23

notNull

31.8k4 gold badges41 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

wwnde · Accepted Answer · 2022-03-17 21:28:36Z

2

Lets try with minimal typing;

df.groupby('department').agg(*[collect_list(c).alias(c) for c in df.drop('department').columns]).show()

answered Mar 17, 2022 at 21:28

wwnde

26.7k6 gold badges22 silver badges38 bronze badges

Collectives™ on Stack Overflow

PySpark DataFrame groupby into list of values?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related