0

I've been thinking the next problem but I haven't reach the solution: I have a dataframe df with only one column A, which elements have dataType Array[String]. I'm trying to get all the different arrays of A, non importing the order of the Strings in the arrays.

For example, if the dataframe is the following:

df.select("A").show()

+--------+
|A       |
+--------+
|[a,b,c] |
|[d,e]   |
|[f]     |
|[e,d]   |
|[c,a,b] |
+--------+

I would like to get the dataframe

+--------+
|[a,b,c] |
|[d,e]   |
|[f]     |
+--------+

I've trying make a distinct(), dropDuplicates() and other functions, but It doesnt't work.

I would appreciate any help. Thank you in advance.

0

2 Answers 2

1

You can use collect_list function to collect all the arrays in that column and then use udf function to sort the individual arrays and finally return the distinct arrays of the collected list. Finally you can use explode function to distribute the distinct collected arrays into separate rows

import org.apache.spark.sql.functions._
def distinctCollectUDF = udf((a: mutable.WrappedArray[mutable.WrappedArray[String]]) => a.map(array => array.sorted).distinct)
df.select(distinctCollectUDF(collect_list("A")).as("A")).withColumn("A", explode($"A")).show(false)

You should have your desired result.

Sign up to request clarification or add additional context in comments.

2 Comments

It might be faster to have a udf to sort the array, then call distinct on the dataframe.
@ayplam may be yes. haven't tested that. But that definitely is another solution
0

You might try and use the contains method.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.