0

I have a problem. I need to extract some data from a file like this:

(3269,
<page>
<title>Anarchism</title>
<ns>0</ns>
<id>12</id>
<revision>...
)
(194712,
<page>
<title>AssistiveTechnology</title>
<ns>0</ns>
<id>23</id>.. 
) etc...

This file was generated using:

val conf = new Configuration
conf.set("textinputformat.record.delimiter", "</page>")
val rdd=sc.newAPIHadoopFile("sample.bz2", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
rdd.map{case (k,v) => (k.get(), new String(v.copyBytes()))}

I need to obtain the title content. Im using regex but the output file still remains empty. My code is like this:

val xx = rdd.map(x => x._2).filter(x => x.matches(".*<title>([A-Za-z]+)<\\/title>.*"))

I also try with these:

".*<title>([A-Za-z]+)</title>.*"

And using this:

val reg = ".*<title>([\\w]+)</title>.*".r
val xx = rdd.map(x => x._2).filter(x => reg.pattern.matcher(x).matches)

I create the .jar using sbt and running with spark-submit.

BTW, using spark-shell it works :S

I need your help please. Thanks.

4
  • I tested in Spark-Shell using a txt containing all the same data newAPIHadoopFile returned... but using it this way, it doeswork, even not in skarp-shell Commented Mar 27, 2017 at 0:05
  • Using this form in spark-shell I got this error: val regexpr = """[a-zA-Z]+""".r val separated = yy.map(line => regexpr.findAllIn(line).toList) Error I got: org.apache.spark.SparkException: Task not serializable Commented Mar 27, 2017 at 0:08
  • I do it in certain way... val filtrado = col3.filter(x => x.matches("[\\n\\s\\W\\w]+<title>([A-Za-z]+)</title>[\\n\\w\\s\\W]+")) but... it retorned the whole page section and not only what I need Commented Mar 27, 2017 at 1:34
  • Ok... I got tire of that s... so... finally, I decided to use substring and indexof and it worked... no more regex for me, thanks. Commented Mar 27, 2017 at 2:44

1 Answer 1

1

You could use built-in Scala support for XML. Something like

import scala.xml._
rdd.map(x => (XML.loadString(x._2) \ "title").text)

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.