1

Say I have a list of keywords in scala

val keywords = List("pineapple", "lemon")

And a dataframe like so

+---+-------------------------------------------+
|ID |Body                                       |
+---+-------------------------------------------+
|123|I contain both keywords pineapple and lemon|
|456|I sadly don't contain anything...          |
|789|Pineapple's are delicious                  |
+---+-------------------------------------------+

How can I transform this dataframe to have a new column with the keywords that Body contains? The result I'm looking for is something like

+---+-------------------------------------------+------------------+
|ID |Body                                       |Contains_Keywords |
+---+-------------------------------------------+------------------+
|123|I contain both keywords pineapple and lemon|[pineapple, lemon]|
|456|I sadly don't contain anything...          |[]                |
|789|Pineapple's are delicious                  |[pineapple]       |
+---+-------------------------------------------+------------------+

2 Answers 2

3

Check below code.

Creating dataframe with required sample data.

scala> val df = Seq(
      (123,"I contain both keywords pineapple and lemon"),
      (456,"I sadly don't contain anything"),
      (789,"Pineapple's are delicious")).toDF("id","body")

df: org.apache.spark.sql.DataFrame = [id: int, body: string]
scala> val keywords = List("pineapple", "lemon")
keywords: List[String] = List(pineapple, lemon)

typedLit to add keywords to dataframe & use filter higher order function to check if that keyword contains body column.

scala> df
.withColumn("keywords",typedLit(keywords))
.withColumn("Contains_Keywords",expr("filter(keywords,keyword -> instr(lower(body),keyword) > 0)"))
.show(false)

Final output

+---+-------------------------------------------+------------------+------------------+
|id |body                                       |keywords          |Contains_Keywords |
+---+-------------------------------------------+------------------+------------------+
|123|I contain both keywords pineapple and lemon|[pineapple, lemon]|[pineapple, lemon]|
|456|I sadly don't contain anything             |[pineapple, lemon]|[]                |
|789|Pineapple's are delicious                  |[pineapple, lemon]|[pineapple]       |
+---+-------------------------------------------+------------------+------------------+
Sign up to request clarification or add additional context in comments.

Comments

2

You can convert the keywords list to a dataframe, then join based on an rlike condition. It's good to add \\\\b before and after the keywords to specify word boundaries, so that you can prevent partial matches, e.g. apple matching pineapple.

val result = df.as("df")
    .join(keywords.toDF("keywords").as("keywords"), 
          expr("lower(df.body) rlike '\\\\b' || keywords.keywords || '\\\\b'"), 
          "left"
         )
    .groupBy("id", "body")
    .agg(collect_list("keywords").as("Contains_keywords"))

result.show(false)
+---+-------------------------------------------+------------------+
|id |body                                       |Contains_keywords |
+---+-------------------------------------------+------------------+
|123|I contain both keywords pineapple and lemon|[pineapple, lemon]|
|789|Pineapple's are delicious                  |[pineapple]       |
|456|I sadly don't contain anything             |[]                |
+---+-------------------------------------------+------------------+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.