I have a problem. I need to extract some data from a file like this:
(3269,
<page>
<title>Anarchism</title>
<ns>0</ns>
<id>12</id>
<revision>...
)
(194712,
<page>
<title>AssistiveTechnology</title>
<ns>0</ns>
<id>23</id>..
) etc...
This file was generated using:
val conf = new Configuration
conf.set("textinputformat.record.delimiter", "</page>")
val rdd=sc.newAPIHadoopFile("sample.bz2", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
rdd.map{case (k,v) => (k.get(), new String(v.copyBytes()))}
I need to obtain the title content. Im using regex but the output file still remains empty. My code is like this:
val xx = rdd.map(x => x._2).filter(x => x.matches(".*<title>([A-Za-z]+)<\\/title>.*"))
I also try with these:
".*<title>([A-Za-z]+)</title>.*"
And using this:
val reg = ".*<title>([\\w]+)</title>.*".r
val xx = rdd.map(x => x._2).filter(x => reg.pattern.matcher(x).matches)
I create the .jar using sbt and running with spark-submit.
BTW, using spark-shell it works :S
I need your help please. Thanks.
val regexpr = """[a-zA-Z]+""".r val separated = yy.map(line => regexpr.findAllIn(line).toList)Error I got:org.apache.spark.SparkException: Task not serializableval filtrado = col3.filter(x => x.matches("[\\n\\s\\W\\w]+<title>([A-Za-z]+)</title>[\\n\\w\\s\\W]+"))but... it retorned the whole page section and not only what I need