I have a pyspark dataframe with multiple columns. For example the one below.
from pyspark.sql import Row
l = [('Jack',"a","p"),('Jack',"b","q"),('Bell',"c","r"),('Bell',"d","s")]
rdd = sc.parallelize(l)
score_rdd = rdd.map(lambda x: Row(name=x[0], letters1=x[1], letters2=x[2]))
score_card = sqlContext.createDataFrame(score_rdd)
+----+--------+--------+
|name|letters1|letters2|
+----+--------+--------+
|Jack| a| p|
|Jack| b| q|
|Bell| c| r|
|Bell| d| s|
+----+--------+--------+
Now I want to group by "name" and concatenate the values in every row for both columns. I know how to do it but let's say there are thousands of rows then my code becomes very ugly. Here is my solution.
import pyspark.sql.functions as f
t = score_card.groupby("name").agg(
f.concat_ws("",collect_list("letters1").alias("letters1")),
f.concat_ws("",collect_list("letters2").alias("letters2"))
)
Here is the output I get when I save it in a CSV file.
+----+--------+--------+
|name|letters1|letters2|
+----+--------+--------+
|Jack| ab| pq|
|Bell| cd| rs|
+----+--------+--------+
But my main concern is about these two lines of code
f.concat_ws("",collect_list("letters1").alias("letters1")),
f.concat_ws("",collect_list("letters2").alias("letters2"))
If there are thousands of columns then I will have to repeat the above code thousands of times. Is there a simpler solution for this so that I don't have to repeat f.concat_ws() for every column?
I have searched everywhere and haven't been able to find a solution.