Regex on io.Text RDD using scala

Question

I have a problem. I need to extract some data from a file like this:

(3269,
<page>
<title>Anarchism</title>
<ns>0</ns>
<id>12</id>
<revision>...
)
(194712,
<page>
<title>AssistiveTechnology</title>
<ns>0</ns>
<id>23</id>.. 
) etc...

This file was generated using:

val conf = new Configuration
conf.set("textinputformat.record.delimiter", "</page>")
val rdd=sc.newAPIHadoopFile("sample.bz2", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
rdd.map{case (k,v) => (k.get(), new String(v.copyBytes()))}

I need to obtain the title content. Im using regex but the output file still remains empty. My code is like this:

val xx = rdd.map(x => x._2).filter(x => x.matches(".*<title>([A-Za-z]+)<\\/title>.*"))

I also try with these:

".*<title>([A-Za-z]+)</title>.*"

And using this:

val reg = ".*<title>([\\w]+)</title>.*".r
val xx = rdd.map(x => x._2).filter(x => reg.pattern.matcher(x).matches)

I create the .jar using sbt and running with spark-submit.

BTW, using spark-shell it works :S

I need your help please. Thanks.

I tested in Spark-Shell using a txt containing all the same data newAPIHadoopFile returned... but using it this way, it doeswork, even not in skarp-shell — Boris Perez
– Boris Perez, Commented Mar 27, 2017 at 0:05
Using this form in spark-shell I got this error: val regexpr = """[a-zA-Z]+""".r val separated = yy.map(line => regexpr.findAllIn(line).toList) Error I got: org.apache.spark.SparkException: Task not serializable — Boris Perez
– Boris Perez, Commented Mar 27, 2017 at 0:08
I do it in certain way... val filtrado = col3.filter(x => x.matches("[\\n\\s\\W\\w]+<title>([A-Za-z]+)</title>[\\n\\w\\s\\W]+")) but... it retorned the whole page section and not only what I need — Boris Perez
– Boris Perez, Commented Mar 27, 2017 at 1:34
Ok... I got tire of that s... so... finally, I decided to use substring and indexof and it worked... no more regex for me, thanks. — Boris Perez
– Boris Perez, Commented Mar 27, 2017 at 2:44

Nikolay Smirnov · Accepted Answer · 2017-04-01 07:44:15Z

1

You could use built-in Scala support for XML. Something like

import scala.xml._
rdd.map(x => (XML.loadString(x._2) \ "title").text)

answered Apr 1, 2017 at 7:44

Nikolay Smirnov

363 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Regex on io.Text RDD using scala

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related