12

I have a column in spark dataframe which has text.

I want to extract all the words which start with a special character '@' and I am using regexp_extract from each row in that text column. If the text contains multiple words starting with '@' it just returns the first one.

I am looking for extracting multiple words which match my pattern in Spark.

data_frame.withColumn("Names", regexp_extract($"text","(?<=^|(?<=[^a-zA-Z0-9-_\.]))@([A-Za-z]+[A-Za-z0-9_]+)",1).show

Sample input: @always_nidhi @YouTube no i dnt understand bt i loved the music nd their dance awesome all the song of this mve is rocking

Sample output: @always_nidhi,@YouTube

2
  • 1
    As per the function definition, the regexp_extract returns only the first match, it does not iterate over the whole text to find all possible matches. You need to write your own UDF to iterate for all matches and return the result as list Commented Dec 26, 2017 at 23:55
  • Hi @AmitKumar, can you please help me with it. I am new to scala and spark and looking to learn. Commented Dec 27, 2017 at 6:43

4 Answers 4

10

You can create a udf function in spark as below:

import java.util.regex.Pattern
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.lit

def regexp_extractAll = udf((job: String, exp: String, groupIdx: Int) => {
      println("the column value is" + job.toString())
      val pattern = Pattern.compile(exp.toString)
      val m = pattern.matcher(job.toString)
      var result = Seq[String]()
      while (m.find) {
        val temp = 
        result =result:+m.group(groupIdx)
      }
      result.mkString(",")
    })

And then call the udf as below:

data_frame.withColumn("Names", regexp_extractAll(new Column("text"), lit("@\\w+"), lit(0))).show()

Above you give you output as below:

+--------------------+
|               Names|
+--------------------+
|@always_nidhi,@Yo...|
+--------------------+

I have used regex, as per the output you have posted in the question. You can modify it to suite your needs.

Sign up to request clarification or add additional context in comments.

1 Comment

I used your suggestion and got the answer using spark sql in scala. First I registered my dataframe as temp table and then saved my UDF in spark sql and then ran the query and voila. I got my answer. Thanks for your help. You are a champ :)
3

You can use java RegEx to extract those words. Below is the working code.

val sparkConf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(sparkConf)

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.functions.{col, udf}
import java.util.regex.Pattern

//User Defined function to extract
def toExtract(str: String) = {      
  val pattern = Pattern.compile("@\\w+")
  val tmplst = scala.collection.mutable.ListBuffer.empty[String]
  val matcher = pattern.matcher(str)
  while (matcher.find()) {
    tmplst += matcher.group()
  }
  tmplst.mkString(",")
}

val Extract = udf(toExtract _)
val values = List("@always_nidhi @YouTube no i dnt understand bt i loved the music nd their dance awesome all the song of this mve is rocking")
val df = sc.parallelize(values).toDF("words")
df.select(Extract(col("words"))).show()

Output

+--------------------+
|          UDF(words)|
+--------------------+
|@always_nidhi,@Yo...|
+--------------------+

Comments

1

In Spark 3.1+ it's possible using regexp_extract_all

Test with your input:

import spark.implicits._
var df = Seq(
    ("@always_nidhi @YouTube no"),
    ("@always_nidhi"),
    ("no")
).toDF("text")

val col_re_list = expr("regexp_extract_all(text, '(?<=^|(?<=[^a-zA-Z0-9-_\\\\.]))@([A-Za-z]+[A-Za-z0-9_]+)', 0)")
df.withColumn("Names", array_join(col_re_list, ", ")).show(false)

// +-------------------------+-----------------------+
// |text                     |Names                  |
// +-------------------------+-----------------------+
// |@always_nidhi @YouTube no|@always_nidhi, @YouTube|
// |@always_nidhi            |@always_nidhi          |
// |no                       |                       |
// +-------------------------+-----------------------+
  • array_join is used, because you wanted results to be in string format while regexp_extract_all returns array.
  • if you use \ for escaping in your pattern, you will need to use \\\\ instead of \, until regexp_extract_all is available directly without expr.

Comments

0

I took the suggestion of Amit Kumar and created a UDF and then ran it in Spark SQL:

select Words(status) as people from dataframe

Words is my UDF and status is my dataframe column.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.