1

I need to achieve something similar to: Checking if values in List is part of String in spark. I.e. there is a dataframe of:

abcd_some long strings
goo bar baz

and an Array of desired words like ["some", "bar"].

An UDF with this code would work just fine, however, I would like to have something more efficient. Is there a way to express the FILTER my_col CONTAINS ONE OF [items] using the SQL DSL? Perhaps by dynamically constructing a REGEX?

NOTE: it is not an exat match rather a regular 'CONTAINS' / LIKE '%thing%'. I.e. not an exact match. Otherwise the isIn operator would work.

edit

probably generating some SQL code dynamically is the most efficient way.

def orFilterGeneratorMultiContains(filterPredicates:Seq[String], column:String):Column = {

    col(column).contains(filterPredicates(0)) or col(column).contains(filterPredicates(1)) // TODO iterate
  }
  def filterToDesiredApps(filterPredicates:Seq[String], column:String)(df:DataFrame):DataFrame={
      df.filter(orFilterGeneratorMultiContains(filterPredicates, column))
  }

So is need to still figure out how to properly iterate the expression.

edit 2

However, this turns out to be a bit tricky:

import org.apache.spark.sql.functions.col

val column = col("foo")
val interstingTHings = Seq("bar", "baz", "thing3")

interstingTHings.foldLeft(column) { (filteredOrColumnExpression, predicateItem) =>
  // TODO how to properly nest the OR operator?
  // filteredOrColumnExpression.contains(predicateItem) // generates: Contains(Contains(Contains('foo, bar), baz), thing3)
  filteredOrColumnExpression or filteredOrColumnExpression.contains(predicateItem) // generates: ((('foo || Contains('foo, bar)) || Contains(('foo || Contains('foo, bar)), baz)) || Contains((('foo || Contains('foo, bar)) || Contains(('foo || Contains('foo, bar)), baz)), thing3)) 
  //     TODO but what y really would need is:
  //      col(column).contains("bar") or col(column).contains("baz") or col(column).contains("thing3")
}.explain(true)

as it does not generate the correct OR nested filter conditions.

2
  • Are you trying to test all items?.. Could you please sample input and desired output?.. Commented Jan 13, 2019 at 12:12
  • Indeed. All Items. But the list of items will be rather small. Commented Jan 13, 2019 at 13:00

2 Answers 2

3

Wouldn't rlike work in this case?

df.filter(col("foo").rlike(interestingThings.mkString("|"))
Sign up to request clarification or add additional context in comments.

Comments

2

You have the right idea, but I think you want to use || not or. Something like:

def orFilterGeneratorMultiContains(filterPredicates:Seq[String], column:String): Column = {
  val coi = col(column)
  filterPredicates.map(coi.contains).reduce(_ || _)
}

1 Comment

Indeed. interstingTHings.map(column.contains).reduce(_ or _).explain(true) is what I want. And || or or are equivalent (as far as I know) for the spark-scala-SQL-DSL.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.