0

I have column :

val originalSqlLikePatternMap = Map("item (%) is blacklisted%" -> "BLACK_LIST",
      "%Testing%" -> "TESTING",
  "%purchase count % is too low %" -> "TOO_LOW_PURCHASE_COUNT")

val javaPatternMap = originalSqlLikePatternMap.map(v => v._1.replaceAll("%", ".*") -> v._2)

val df = Seq(
  "Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low", 
  "Foo purchase count (12, 4) is too low ", "#!@", "item (mejwnw) is blacklisted",
   "item (1) is blacklisted, #!@" 
).toDF("raw_type")

val converter = (value: String) => javaPatternMap.find(v => value.matches(v._1)).map(_._2).getOrElse("Unknown")
val converterUDF = udf(converter)

val result = df.withColumn("updatedType", converterUDF($"raw_type"))

but it gives :

+---------------------------------------------------------+----------------------+
|raw_type                                                 |updatedType           |
+---------------------------------------------------------+----------------------+
|Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low|TESTING               |
|Foo purchase count (12, 4) is too low                    |TOO_LOW_PURCHASE_COUNT|
|#!@                                                      |Unknown               |
|item (mejwnw) is blacklisted                             |BLACK_LIST            |
|item (1) is blacklisted, #!@                             |BLACK_LIST            |
+---------------------------------------------------------+----------------------+

But I want "Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low" to give 2 values "TESTING, TOO_LOW_PURCHASE_COUNT" like this :

 +---------------------------------------------------------+--------------------------------+
|raw_type                                                 |updatedType                     |
+---------------------------------------------------------+--------------------------------+
|Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low|TESTING, TOO_LOW_PURCHASE_COUNT |
|Foo purchase count (12, 4) is too low                    |TOO_LOW_PURCHASE_COUNT          |
|#!@                                                      |Unknown                         |
|item (mejwnw) is blacklisted                             |BLACK_LIST                      |
|item (1) is blacklisted, #!@                             |BLACK_LIST, Unkown              |
+---------------------------------------------------------+--------------------------------+

Can someone tell what I am doing wrong here ?

1 Answer 1

2

Ok. So, couple of things here,

  1. Regarding find, you need to check each Row against each regex for your desired output, so find is not the right choice.

    the first value produced by the iterator satisfying a predicate, if any.

  2. Take care with regex, you've left a space after low, thats why its not matching. May you should reconsider just replacing % with .* also,

    %purchase count % is too low %

So, with the changes, your code will be something like,

 val originalSqlLikePatternMap = Map(
      "item (%) is blacklisted%" -> "BLACK_LIST",
      "%Testing%" -> "TESTING",
      "%purchase count % is too low%" -> "TOO_LOW_PURCHASE_COUNT")

    val javaPatternMap = originalSqlLikePatternMap.map(v => v._1.replaceAll("%", ".*").r -> v._2)

    val df = Seq(
      "Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low",
      "Foo purchase count (12, 4) is too low ", "#!@", "item (mejwnw) is blacklisted",
      "item (1) is blacklisted, #!@"
    ).toDF("raw_type")

    val converter = (value: String) => {
      val res = javaPatternMap.map(v => {
        v._1.findFirstIn(value) match {
          case Some(_) => v._2
          case None => ""
        }
      })
        .filter(_.nonEmpty).mkString(", ")

      if (res.isEmpty) "Unknown" else res
    }

    val converterUDF = udf(converter)

    val result = df.withColumn("updatedType", converterUDF($"raw_type"))

    result.show(false)

Output,

+---------------------------------------------------------+-------------------------------+
|raw_type                                                 |updatedType                    |
+---------------------------------------------------------+-------------------------------+
|Testing(2,4, (4,6,7) foo, Foo purchase count 1 is too low|TESTING, TOO_LOW_PURCHASE_COUNT|
|Foo purchase count (12, 4) is too low                    |TOO_LOW_PURCHASE_COUNT         |
|#!@                                                      |Unknown                        |
|item (mejwnw) is blacklisted                             |BLACK_LIST                     |
|item (1) is blacklisted, #!@                             |BLACK_LIST                     |
+---------------------------------------------------------+-------------------------------+

Hope this helps!

Sign up to request clarification or add additional context in comments.

3 Comments

currently the udf is written im a way that if there are no matches, Unknown value is given. If there are multiple regex matches as you asked, they're returned, like for the top row. Unkown is thus the default value. to make the change to, case None => "Unkown"
So if there are multiple values it will match and if there is any unknown followed by other matches,.. it won't ?
yes because there is no unknown specific rule in javaPatternMap

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.