0

I have a XML file in S3 contains the Schema for my table called sample:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<DATA  CHARSET="UTF8" DELIMITER="\t">
<COLUMNS>
<COLUMN DATA_PRECISION="10" DATA_SCALE="0" DATA_TYPE="NUMBER"  ID="APPLICATION_ID" />
<COLUMN DATA_LENGTH="40" DATA_TYPE="VARCHAR2" ID="DESCRIPTIVE_FLEXFIELD_NAME"/>
<COLUMN DATA_LENGTH="30" DATA_TYPE="VARCHAR2" ID="LANGUAGE"/>
<COLUMN DATA_PRECISION="15" DATA_SCALE="0" DATA_TYPE="NUMBER" ID="CREATED_BY" />
<COLUMN DATA_PRECISION="10" DATA_SCALE="0" DATA_TYPE="NUMBER" ID="LAST_UPDATE_LOGIN" />

</COLUMNS>
</DATA>

I already write script as sample.filter( x => x.contains("DATA_TYPE") || x.contains("ID")), and I would need to get each pair of the value for (ID,DATA_TYPE) so the final output should be like

("APPLICATION_ID","NUMBER"),("DESCRIPTIVE_FLEXFIELD_NAME","VARCHAR2"),etc.

Anyone can help me on this?

Thanks!!!

2 Answers 2

1

You can use xml reader in spark as

val df = sqlContext.read
  .format("com.databricks.spark.xml")
  .option("rootTag", "DATA")
  .option("rowTag", "COLUMNS")
  .load("path to the xml file")

And apply some inbuilt functions such as explode and select as

import org.apache.spark.sql.functions._
df.select(explode(col("COLUMN")))
  .select(col("col.*"))
  .filter(col("_DATA_TYPE").isNotNull || col("_ID").isNotNull)
  .select("_DATA_TYPE", "_ID")

which should give you

+----------+--------------------------+
|_DATA_TYPE|_ID                       |
+----------+--------------------------+
|NUMBER    |APPLICATION_ID            |
|VARCHAR2  |DESCRIPTIVE_FLEXFIELD_NAME|
|VARCHAR2  |LANGUAGE                  |
|NUMBER    |CREATED_BY                |
|NUMBER    |LAST_UPDATE_LOGIN         |
+----------+--------------------------+

Now you can convert it to rdd and parse as you want or leave it as dataframe and apply other operations on it.

Sign up to request clarification or add additional context in comments.

Comments

1

While this can be done with spark-xml in an easier and more robust way. You can also use the scala.xml API to parse:

import scala.xml._

val rdd = sc.textFile("file.xml").filter( x => x.contains("DATA_TYPE") || x.contains("ID"))

rdd.map(XML.loadString(_) )
   .map(node => ( (node \\ "@ID").text , (node \\ "@DATA_TYPE").text  ) )
   .collect.foreach(println)

// (APPLICATION_ID,NUMBER)
// (DESCRIPTIVE_FLEXFIELD_NAME,VARCHAR2)
// (LANGUAGE,VARCHAR2)
// (CREATED_BY,NUMBER)
// (LAST_UPDATE_LOGIN,NUMBER)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.