PySpark - Group by Array column

Question

I am very new to pySpark. Appreciate your help.. I have a dataframe

test["1"]={"vars":["x1","x2"]}
test["2"]={"vars":["x2"]}
test["3"]={"vars":["x3"]}
test["4"]={"vars":["x2","x3"]}
pdDF = pd.DataFrame(test).transpose()
sparkDF=spark.createDataFrame(pdDF) 

+--------+
|    vars|
+--------+
|[x1, x2]|
|    [x2]|
|    [x3]|
|[x2, x3]|
+--------+

I am looking for a way to group column "vars" by values in the list and count I am looking for next result:


+-----+---+
|count|var|
+-----+---+
|    1| x1|
|    3| x2|
|    2| x3|
+-----+---+

Can somebody advise how to achieve this?

Thanks in advance!

Prathik Kini · Accepted Answer · 2019-10-10 08:48:08Z

2

from pyspark.sql.functions import explode
values = [(["x1","x2"],),(["x2"],),(["x3"],),(["x2","x3"],)]
df = sqlContext.createDataFrame(values,['vars'])
df.show()

+--------+
|    vars|
+--------+
|[x1, x2]|
|    [x2]|
|    [x3]|
|[x2, x3]|
+--------+

newdf=df.withColumn("vars2", explode(df.vars))
newdf.groupBy('vars2').count().show()

+-----+-----+
|vars2|count|
+-----+-----+
|   x2|    3|
|   x3|    2|
|   x1|    1|
+-----+-----+

answered Oct 10, 2019 at 8:48

Prathik Kini

1,8101 gold badge19 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

PySpark - Group by Array column

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related