combine text from multiple rows in pyspark

Question

I created a PySpark dataframe using the following code

testlist = [
             {"category":"A","name":"A1"}, 
             {"category":"A","name":"A2"}, 
             {"category":"B","name":"B1"},
             {"category":"B","name":"B2"}
]

spark_df = spark.createDataFrame(testlist)

Result:

category    name
A           A1
A           A2
B           B1
B           B2

I want to make it appear as follows:

category    name
A           A1, A2
B           B1, B2

I tried the following code which does not work

spark_df.groupby('category').agg('name', lambda x:x + ', ')

Can anyone help identify what I am doing wrong and the best way to make this happen ?

pault · Accepted Answer · 2019-06-10 18:53:26Z

24

One option is to use pyspark.sql.functions.collect_list() as the aggregate function.

from pyspark.sql.functions import collect_list
grouped_df = spark_df.groupby('category').agg(collect_list('name').alias("name"))

This will collect the values for name into a list and the resultant output will look like:

grouped_df.show()
#+---------+---------+
#|category |name     |
#+---------+---------+
#|A        |[A1, A2] |
#|B        |[B1, B2] |
#+---------+---------+

Update 2019-06-10: If you wanted your output as a concatenated string, you can use pyspark.sql.functions.concat_ws to concatenate the values of the collected list, which will be better than using a udf:

from pyspark.sql.functions import concat_ws

grouped_df.withColumn("name", concat_ws(", ", "name")).show()
#+---------+-------+
#|category |name   |
#+---------+-------+
#|A        |A1, A2 |
#|B        |B1, B2 |
#+---------+-------+

Original Answer: If you wanted your output as a concatenated string, you'd ~~have to~~ can use a udf. For example, you can first do the groupBy() as above and the apply a udf to join the collected list:

from pyspark.sql.functions import udf
concat_list = udf(lambda lst: ", ".join(lst), StringType())

grouped_df.withColumn("name", concat_list("name")).show()
#+---------+-------+
#|category |name   |
#+---------+-------+
#|A        |A1, A2 |
#|B        |B1, B2 |
#+---------+-------+

edited Jun 10, 2019 at 18:53

answered Mar 24, 2018 at 20:15

pault

43.7k17 gold badges121 silver badges161 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Harry Over a year ago

Hi pault, thanks for your help. I am wondering how you got the answer. I tried exactly the same and I get the error 'NoneType does not have attribute withColumn'. Please let me know if I should correct something. Thanks !

pault Over a year ago

Are you sure you didn't set grouped_df equal to that output of .show()? That error means the df for which you are calling withColumn() is actually None

s510 · Accepted Answer · 2022-07-19 09:12:56Z

4

UNIQUE values

If you want unique values then use collect_set instead of collect_list

from pyspark.sql.functions import collect_set
grouped_df = sdf.groupby('category').agg(collect_set('name').alias("unique_name"))
sdf.show(5)

answered Jul 19, 2022 at 9:12

s510

2,88219 silver badges25 bronze badges

Comments

Bala · Accepted Answer · 2018-03-24 21:07:23Z

1

Another option is this

>>> df.rdd.reduceByKey(lambda x,y: x+','+y).toDF().show()
+---+-----+
| _1|   _2|
+---+-----+
|  A|A1,A2|
|  B|B1,B2|
+---+-----+

answered Mar 24, 2018 at 21:07

Bala

11.3k19 gold badges75 silver badges134 bronze badges

2 Comments

Harry Over a year ago

Hi Bala, Thanks for your help. I tried your script on my data which has more than instances of each key. It gives me the error: ValueError: too many values to unpack (expected 2)

Bala Over a year ago

reduceByKey works with more than one instance of key. Can you give us sample data having exact pattern you are working with.

Collectives™ on Stack Overflow

combine text from multiple rows in pyspark

3 Answers 3

2 Comments

UNIQUE values

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

UNIQUE values

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related