0

I have a array column on which i find text from it and form a dataframe. Which is the better way among the below 2 options? Option 1

val texts = Seq("text1", "text2", "text3")
 val df = mainDf.select(col("*"))
      .withColumn("temptext", explode($"textCol"))
      .where($"temptext".isin(texts: _*))

And since it has added and extra column "temptext" and increased duplicate rows by exploding

  val tempDf = df.drop("temptext").dropDuplicates("Root.Id")  // dropDuplicates does not work since I have passed nested field

vs

Option 2

val df = mainDf.select(col("*"))
  .where(array_contains($"textCol", "text1") ||
      array_contains($"textCol", "text2") ||
      array_contains($"textCol", "text3"))

Actually I wanted to make a generic api, If I go with option 2

then the problem is for every new text i need to add array_contains($"textCol", "text4") and create new api every time

and in option 1 it creates duplicate rows since I explode the array and also needs to drop the temporary column

5
  • 1
    use array_contains check here Commented Apr 24, 2020 at 7:51
  • @Yogesh Is there a way to pass multiple values in array_contains..I have not yet found on web Commented Apr 24, 2020 at 7:57
  • what do you mean by multiple values in array_contains ? Commented Apr 24, 2020 at 8:03
  • Seq("text1", "text2", "text3") if either of the one is in array then return true..So is there a way to pass like array_contains($"textCol",(text1", "text2", "text3")).. Like in clause Commented Apr 24, 2020 at 8:08
  • got it so this not possible in array_contains you can used any method mentioned in answers. Commented Apr 24, 2020 at 8:48

1 Answer 1

2

Use arrays_overlap (or) array_intersect functions to pass array(<strings>) instead of array_contains.

Example: 1.filter based on texts variable:

val df=Seq((Seq("text1")),(Seq("text4","text1")),(Seq("text5"))).
toDF("textCol")

df.show()
//+--------------+
//|       textCol|
//+--------------+
//|       [text1]|
//|[text4, text1]|
//|       [text5]|
//+--------------+

val texts = Array("text1","text2","text3")

//using arrays_overlap    
df.filter(arrays_overlap(col("textcol"),lit(texts))).show(false)
//+--------------+
//|textCol       |
//+--------------+
//|[text1]       |
//|[text4, text1]|
//+--------------+

//using arrays_intersect    
df.filter(size(array_intersect(col("textcol"),lit(texts))) > 0).show(false)
//+--------------+
//|textCol       |
//+--------------+
//|[text1]       |
//|[text4, text1]|
//+--------------+

2.Adding texts variable to the dataframe:

val texts = "text1,text2,text3"

val df=Seq((Seq("text1")),(Seq("text4","text1")),(Seq("text5"))).
toDF("textCol").
withColumn("texts",split(lit(s"${texts}"),","))

df.show(false)
//+--------------+---------------------+
//|textCol       |texts                |
//+--------------+---------------------+
//|[text1]       |[text1, text2, text3]|
//|[text4, text1]|[text1, text2, text3]|
//|[text5]       |[text1, text2, text3]|
//+--------------+---------------------+

//using array_intersect
df.filter("""size(array_intersect(textcol,texts)) > 0""").show(false)
//+--------------+---------------------+
//|textCol       |texts                |
//+--------------+---------------------+
//|[text1]       |[text1, text2, text3]|
//|[text4, text1]|[text1, text2, text3]|
//+--------------+---------------------+

//using arrays_overlap
df.filter("""arrays_overlap(textcol,texts)""").show(false)
+--------------+---------------------+
|textCol       |texts                |
+--------------+---------------------+
|[text1]       |[text1, text2, text3]|
|[text4, text1]|[text1, text2, text3]|
+--------------+---------------------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.