2

I have XML that I'm trying to use Scala XML API. I have XPath queries to retrieve the data from the XML tags. I want to retrieve <price> tag value from <market> but using the two attributes _id and type. I want to write a condition with && so that I'll get a unique value for each price tag, e.g. where MARKET _ID = 1 && TYPE = "A".

For reference find XML below:

<publisher>
    <book _id = "0"> 
        <author _id="0">Dev</author>
        <publish_date>24 Feb 1995</publish_date>
        <description>Data Structure - C</description>
        <market _id="0" type="A">
            <price>45.95</price>            
        </market>
        <market _id="0" type="B">
            <price>55.95</price>
        </market>
    </book>
    <book _id="1"> 
        <author _id = "1">Ram</author>
        <publish_date>02 Jul 1999</publish_date>
        <description>Data Structure - Java</description>
        <market _id="1" type="A">
            <price>145.95</price>           
        </market>   
        <market _id="1" type="B">
            <price>155.95</price>           
        </market>
    </book>
</publisher>

The following code is working fine

import scala.xml._

object XMLtoCSV extends App {

  val xmlLoad = XML.loadFile("C:/Users/sharprao/Desktop/FirstTry.xml")  

  val price = (((xmlLoad \ "book" filter { _ \ "@_id" exists (_.text == "0")}) \ "market" filter { _ \ "@_id" exists (_.text == "0")}) \ "price").text  //45.95
  val price1 = (((xmlLoad \ "book" filter { _ \ "@_id" exists (_.text == "1")}) \ "market" filter { _ \ "@_id" exists (_.text == "1")}) \ "price").text  //155.95

  println("price = " + price)
  println("price1 = " + price1)
} 

The output is:

price = 45.9555.95
price1 = 145.95155.95

My above code is giving me both the values as I'm not able to put && conditions.

  1. Please advice other than filter what SCALA function I can use.
  2. Also let me know how to get the all attribute names.
  3. If possible please let me know from where I can read all these APIs.

Thanks in Advance.

3 Answers 3

2

You could write a custom predicate to check multiple attributes:

def checkMarket(marketId: String, marketType: String)(node: Node): Boolean = {
  node.attribute("_id").exists(_.text == marketId) &&
  node.attribute("type").exists(_.text == marketType)
}

Then use it as a filter:

val price1 = (((xmlLoad \ "book" filter (_ \ "@_id" exists (_.text == "0"))) \ "market" filter checkMarket("0", "A")) \ "price").text
// 45.95

val price2 = (((xmlLoad \ "book" filter (_ \ "@_id" exists (_.text == "1"))) \ "market" filter checkMarket("1", "B")) \ "price").text
// 155.95
Sign up to request clarification or add additional context in comments.

3 Comments

I appreciate your solution, but without writing function can we do it - is there any SCALA function which can fit in this scenario.
One more thing, I have shared a sample xml with you. But my xml is very big. Almost 200 tags that means I have to write 200 functions, because attributes are different for different tags from one to six different attribute. I think I have to write 6 functions and have to change the parameter.
@PardeepSharma Ask another question with a sample of some of the tags.
1

This would be the way to write it if you are interested in getting a CSV file of your data:

(xmlload \ "book").flatMap { bk =>
  (bk \ "market").flatMap { mkt =>
    (mkt \ "price").map { p =>
      Seq(
        bk \@ "_id",
        mkt \@ "_id",
        mkt \@ "type",
        p.text.toFloat
      )
    }
  }
}.map { cols =>
  cols.mkString("\t")
}.foreach { 
  println
}

It will output the following:

0       0       A       45.95
0       0       B       55.95
1       1       A       145.95
1       1       B       155.95

And a common pattern to recognize when writing Scala: Is that most flatMap flatMap ... map can be rewritten to for-comprehensions:

for {
    book <- xmlload \ "book"
    market <- book \ "market"
    price <- market \ "price"
} yield {
  val cols = Seq(
    book \@ "_id",
    market \@ "_id",
    market \@ "type",
    price.text.toFloat
  )
  println(cols.mkString("\t"))
}

Comments

-1

I used Spark and with hiveContext I was able to parse the xPath.

object xPathReader extends App{

    System.setProperty("hadoop.home.dir","D:\\IBM\\DB\\Hadoop\\winutils")   // Path for my winutils.exe

    val sparkConf = new SparkConf().setAppName("XMLParcing").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)
    val hiveContext = new HiveContext(sc)
    val myXmlPath = "D:\\IBM\\DB\\xml"
    val xmlRDDList = XmlFileUtil.withCharset(sc, myXmlPath, "UTF-8", "publisher") //XmlFileUtil - this is a private class in scala hence I created a Java class to use it.

    import hiveContext.implicits._

    val xmlDf = xmlRDDList.toDF("tempXMLTable")
    xmlDf.registerTempTable("tempTable")

    hiveContext.sql("select xpath_string(tempXMLTable,\"/book/@_id\") as BookId, xpath_float(tempXMLTable,\"/book/market[@_id='1' and @type='B']/price\") as Price from tempTable").show()      

    /*  Output
        +------+------+
        |BookId| Price|
        +------+------+
        |     0| 55.95|
        |     1|155.95|
        +------+------+
    */
}

5 Comments

This had nothing to do with the original question which was about parsing the XML with scala-xml, not XPath in Spark.
I have provided an alternative, I didn't say this is an answer for my solution.
Because XmlFile.withCharset was private object we were not able to use it hence I have implemented xmlFileUtil. public class XmlFileUtil { public static RDD<String> withCharset(SparkContext context, String location, String charset, String rowTag) { return XmlFile.withCharset(context, location, charset, rowTag); } }
Interesting, you should ask a new question about that
thank you @ashawley - I just wanna share another approach. Sure I'll ask another question and put these comments there.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.