How to transform DF to add column with list of string contained within another column

Question

Say I have a list of keywords in scala

val keywords = List("pineapple", "lemon")

And a dataframe like so

+---+-------------------------------------------+
|ID |Body                                       |
+---+-------------------------------------------+
|123|I contain both keywords pineapple and lemon|
|456|I sadly don't contain anything...          |
|789|Pineapple's are delicious                  |
+---+-------------------------------------------+

How can I transform this dataframe to have a new column with the keywords that Body contains? The result I'm looking for is something like

+---+-------------------------------------------+------------------+
|ID |Body                                       |Contains_Keywords |
+---+-------------------------------------------+------------------+
|123|I contain both keywords pineapple and lemon|[pineapple, lemon]|
|456|I sadly don't contain anything...          |[]                |
|789|Pineapple's are delicious                  |[pineapple]       |
+---+-------------------------------------------+------------------+

s.polam · Accepted Answer · 2021-03-24 04:23:54Z

Check below code.

Creating dataframe with required sample data.

scala> val df = Seq(
      (123,"I contain both keywords pineapple and lemon"),
      (456,"I sadly don't contain anything"),
      (789,"Pineapple's are delicious")).toDF("id","body")

df: org.apache.spark.sql.DataFrame = [id: int, body: string]

scala> val keywords = List("pineapple", "lemon")
keywords: List[String] = List(pineapple, lemon)

typedLit to add keywords to dataframe & use filter higher order function to check if that keyword contains body column.

scala> df
.withColumn("keywords",typedLit(keywords))
.withColumn("Contains_Keywords",expr("filter(keywords,keyword -> instr(lower(body),keyword) > 0)"))
.show(false)

Final output

+---+-------------------------------------------+------------------+------------------+
|id |body                                       |keywords          |Contains_Keywords |
+---+-------------------------------------------+------------------+------------------+
|123|I contain both keywords pineapple and lemon|[pineapple, lemon]|[pineapple, lemon]|
|456|I sadly don't contain anything             |[pineapple, lemon]|[]                |
|789|Pineapple's are delicious                  |[pineapple, lemon]|[pineapple]       |
+---+-------------------------------------------+------------------+------------------+

mck · Accepted Answer · 2021-03-24 06:54:29Z

You can convert the keywords list to a dataframe, then join based on an rlike condition. It's good to add \\\\b before and after the keywords to specify word boundaries, so that you can prevent partial matches, e.g. apple matching pineapple.

val result = df.as("df")
    .join(keywords.toDF("keywords").as("keywords"), 
          expr("lower(df.body) rlike '\\\\b' || keywords.keywords || '\\\\b'"), 
          "left"
         )
    .groupBy("id", "body")
    .agg(collect_list("keywords").as("Contains_keywords"))

result.show(false)
+---+-------------------------------------------+------------------+
|id |body                                       |Contains_keywords |
+---+-------------------------------------------+------------------+
|123|I contain both keywords pineapple and lemon|[pineapple, lemon]|
|789|Pineapple's are delicious                  |[pineapple]       |
|456|I sadly don't contain anything             |[]                |
+---+-------------------------------------------+------------------+

Collectives™ on Stack Overflow

How to transform DF to add column with list of string contained within another column

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related