In Spark 2.4+ you can get similar behavior to MySQL's GROUP_CONCAT() and Redshift's LISTAGG() with the help of collect_list() and array_join(), without the need for any UDFs.
Here's a demonstration in PySpark, though the code should be very similar for Scala too:
from pyspark.sql.functions import array_join, sort_array, collect_list
friends = spark.createDataFrame(
[
('jacques', 'nicolas'),
('jacques', 'georges'),
('jacques', 'francois'),
('bob', 'amelie'),
('bob', 'zoe'),
],
schema=['username', 'friend'],
)
(
friends
.groupBy('username')
.agg(
array_join(
sort_array(
collect_list('friend'),
asc=False,
),
delimiter=', ',
).alias('friends')
)
.show(truncate=False)
)
In Spark SQL the solution is likewise:
SELECT
username,
array_join(
sort_array(
collect_list(friend),
false
),
', '
) AS friends
FROM friends
GROUP BY username;
Here's the output:
+--------+--------------------------+
|username|friends |
+--------+--------------------------+
|jacques |nicolas, georges, francois|
|bob |zoe, amelie |
+--------+--------------------------+
Note that if you want the joined elements to appear in a particular order, you should rely on one of the array sorting functions like sort_array or array_sort (they are different) rather than on ORDER BY. That's because collect_list will order the results in a non-deterministic way.
collect_list()andarray_join(). No need for UDFs. For the details, see my answer.