get distinct count from an array of each rows using pyspark

Question

I am looking for distinct counts from an array of each rows using pyspark dataframe: input: col1 [1,1,1] [3,4,5] [1,2,1,2]

output:
1
3
2  

I used below code but it is giving me the length of an array:
output:
3
3
4

please help me how do i achieve this using python pyspark dataframe.

slen = udf(lambda s: len(s), IntegerType())
count = Df.withColumn("Count", slen(df.col1))
count.show()

Thanks in advanced !

murtihash · Accepted Answer · 2020-02-27 21:04:19Z

12

For spark2.4+ you can use array_distinct and then just get the size of that, to get count of distinct values in your array. Using UDF will be very slow and inefficient for big data, always try to use spark in-built functions.

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.array_distinct

(welcome to SO)

df.show()

+------------+
|        col1|
+------------+
|   [1, 1, 1]|
|   [3, 4, 5]|
|[1, 2, 1, 2]|
+------------+

df.withColumn("count", F.size(F.array_distinct("col1"))).show()

+------------+-----+
|        col1|count|
+------------+-----+
|   [1, 1, 1]|    1|
|   [3, 4, 5]|    3|
|[1, 2, 1, 2]|    2|
+------------+-----+

answered Feb 27, 2020 at 21:04

murtihash

8,4401 gold badge16 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

get distinct count from an array of each rows using pyspark

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related