Extract words from a string column in spark dataframe

Question

I have a column in spark dataframe which has text.

I want to extract all the words which start with a special character '@' and I am using regexp_extract from each row in that text column. If the text contains multiple words starting with '@' it just returns the first one.

I am looking for extracting multiple words which match my pattern in Spark.

data_frame.withColumn("Names", regexp_extract($"text","(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z]+[A-Za-z0-9_]+)",1).show

Sample input: @always_nidhi @YouTube no i dnt understand bt i loved the music nd their dance awesome all the song of this mve is rocking

Sample output: @always_nidhi,@YouTube

As per the function definition, the regexp_extract returns only the first match, it does not iterate over the whole text to find all possible matches. You need to write your own UDF to iterate for all matches and return the result as list — Amit Kumar
– Amit Kumar, Commented Dec 26, 2017 at 23:55
Hi @AmitKumar, can you please help me with it. I am new to scala and spark and looking to learn. — Sree51
– Sree51, Commented Dec 27, 2017 at 6:43

Amit Kumar · Accepted Answer · 2017-12-27 09:21:50Z

10

You can create a udf function in spark as below:

import java.util.regex.Pattern
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.lit

def regexp_extractAll = udf((job: String, exp: String, groupIdx: Int) => {
      println("the column value is" + job.toString())
      val pattern = Pattern.compile(exp.toString)
      val m = pattern.matcher(job.toString)
      var result = Seq[String]()
      while (m.find) {
        val temp = 
        result =result:+m.group(groupIdx)
      }
      result.mkString(",")
    })

And then call the udf as below:

data_frame.withColumn("Names", regexp_extractAll(new Column("text"), lit("@\\w+"), lit(0))).show()

Above you give you output as below:

+--------------------+
|               Names|
+--------------------+
|@always_nidhi,@Yo...|
+--------------------+

I have used regex, as per the output you have posted in the question. You can modify it to suite your needs.

answered Dec 27, 2017 at 9:21

Amit Kumar

1,58412 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Sree51 Over a year ago

I used your suggestion and got the answer using spark sql in scala. First I registered my dataframe as temp table and then saved my UDF in spark sql and then ran the query and voila. I got my answer. Thanks for your help. You are a champ :)

Souvik · Accepted Answer · 2017-12-27 09:29:08Z

You can use java RegEx to extract those words. Below is the working code.

val sparkConf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(sparkConf)

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.functions.{col, udf}
import java.util.regex.Pattern

//User Defined function to extract
def toExtract(str: String) = {      
  val pattern = Pattern.compile("@\\w+")
  val tmplst = scala.collection.mutable.ListBuffer.empty[String]
  val matcher = pattern.matcher(str)
  while (matcher.find()) {
    tmplst += matcher.group()
  }
  tmplst.mkString(",")
}

val Extract = udf(toExtract _)
val values = List("@always_nidhi @YouTube no i dnt understand bt i loved the music nd their dance awesome all the song of this mve is rocking")
val df = sc.parallelize(values).toDF("words")
df.select(Extract(col("words"))).show()

Output

+--------------------+
|          UDF(words)|
+--------------------+
|@always_nidhi,@Yo...|
+--------------------+

ZygD · Accepted Answer · 2021-10-30 17:15:45Z

In Spark 3.1+ it's possible using regexp_extract_all

Test with your input:

import spark.implicits._
var df = Seq(
    ("@always_nidhi @YouTube no"),
    ("@always_nidhi"),
    ("no")
).toDF("text")

val col_re_list = expr("regexp_extract_all(text, '(?<=^|(?<=[^a-zA-Z0-9-_\\\\.]))@([A-Za-z]+[A-Za-z0-9_]+)', 0)")
df.withColumn("Names", array_join(col_re_list, ", ")).show(false)

// +-------------------------+-----------------------+
// |text                     |Names                  |
// +-------------------------+-----------------------+
// |@always_nidhi @YouTube no|@always_nidhi, @YouTube|
// |@always_nidhi            |@always_nidhi          |
// |no                       |                       |
// +-------------------------+-----------------------+

array_join is used, because you wanted results to be in string format while regexp_extract_all returns array.
if you use \ for escaping in your pattern, you will need to use \\\\ instead of \, until regexp_extract_all is available directly without expr.

ZygD · Accepted Answer · 2021-10-30 17:17:22Z

0

I took the suggestion of Amit Kumar and created a UDF and then ran it in Spark SQL:

select Words(status) as people from dataframe

Words is my UDF and status is my dataframe column.

edited Oct 30, 2021 at 17:17

ZygD

24.8k41 gold badges106 silver badges144 bronze badges

answered Dec 27, 2017 at 9:29

Sree51

991 gold badge1 silver badge12 bronze badges

Collectives™ on Stack Overflow

Extract words from a string column in spark dataframe

4 Answers 4

1 Comment

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related