Spark-scala: Select distinct arrays from a column dataframe ignoring ordering

Question

I've been thinking the next problem but I haven't reach the solution: I have a dataframe df with only one column A, which elements have dataType Array[String]. I'm trying to get all the different arrays of A, non importing the order of the Strings in the arrays.

For example, if the dataframe is the following:

df.select("A").show()

+--------+
|A       |
+--------+
|[a,b,c] |
|[d,e]   |
|[f]     |
|[e,d]   |
|[c,a,b] |
+--------+

I would like to get the dataframe

+--------+
|[a,b,c] |
|[d,e]   |
|[f]     |
+--------+

I've trying make a distinct(), dropDuplicates() and other functions, but It doesnt't work.

I would appreciate any help. Thank you in advance.

Anahcolus · Accepted Answer · 2017-11-04 04:22:47Z

1

You can use collect_list function to collect all the arrays in that column and then use udf function to sort the individual arrays and finally return the distinct arrays of the collected list. Finally you can use explode function to distribute the distinct collected arrays into separate rows

import org.apache.spark.sql.functions._
def distinctCollectUDF = udf((a: mutable.WrappedArray[mutable.WrappedArray[String]]) => a.map(array => array.sorted).distinct)
df.select(distinctCollectUDF(collect_list("A")).as("A")).withColumn("A", explode($"A")).show(false)

You should have your desired result.

answered Nov 4, 2017 at 4:22

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ayplam Over a year ago

It might be faster to have a udf to sort the array, then call distinct on the dataframe.

Anahcolus Over a year ago

@ayplam may be yes. haven't tested that. But that definitely is another solution

Inxsible · Accepted Answer · 2017-11-03 19:34:55Z

0

You might try and use the contains method.

answered Nov 3, 2017 at 19:34

Inxsible

7205 silver badges27 bronze badges

Collectives™ on Stack Overflow

Spark-scala: Select distinct arrays from a column dataframe ignoring ordering

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related