2

I am trying to process a xml file using scala and spark.

I have this schema:

root
 |-- IdKey: long (nullable = true)
 |-- Value: string (nullable = true)
 |-- CDate: date (nullable = true)

And I want to process this xml file:

<Item>
    <CDate>2018-05-08T00:00::00</CDate>
    <ListItemData>
        <ItemData>
            <IdKey>2</IdKeyData>
            <Value>1</Value>
        </ItemData>
        <ItemData>
            <IdKey>61</IdKeyData>
            <Value>2</Value>
        </ItemData>
    <ListItemData>
</Item>

I am using this code:

sqlContext.read.format("com.databricks.spark.xml") .option("rowTag", "Item") .schema(schema) .load(xmlFile)

But my result is a table without the CDate column:

+------------+
IdKey         |Value    | CDate |
+------------+
|61           |1        | null
|2            |2        | null

Is possible parse the xml file with this schema ? I want to obtain this values:

+------------+
IdKey         |Value    | CDate |
+------------+
|61           |1        | 2018-05-08T00:00::00
|2            |2        | 2018-05-08T00:00::00

Thanks

2
  • Is your xml data a valid xml? I don't think it is a valid xml data Commented May 8, 2018 at 16:55
  • I forgot to close a tag. But the xml original is correct.Thanks! Commented May 9, 2018 at 7:23

2 Answers 2

2

I can see your xml as an invalid The valid xml should look like this in your case

<Item>
    <CDate>2018-05-08T00:00::00</CDate>
    <ListItemData>
    <ItemData>
        <IdKey>2</IdKey>
        <Value>1</Value>
    </ItemData>
    <ItemData>
        <IdKey>61</IdKey>
        <Value>2</Value>
    </ItemData>
    </ListItemData>
</Item>

If you have this corrected xml data then you can create a schema as

val innerSchema = StructType(
  StructField("ItemData",
    ArrayType(
      StructType(
        StructField("IdKey",LongType,true)::
          StructField("Value",LongType,true)::Nil
      )
    ),true)::Nil
)
val schema = StructType(
  StructField("CDate",StringType,true)::
  StructField("ListItemData", innerSchema, true):: Nil
)

Apply this schema to read xml file

val df = spark.sqlContext.read.format("com.databricks.spark.xml")
  .option("rowTag", "Item")
  .schema(schema)
  .load(xmlFile)
  //Selecy nested field and explode to get the flattern result
  .withColumn("ItemData", explode($"ListItemData.ItemData"))
  .select("CDate", "ItemData.*") // select required column

Now you can get the required output

+--------------------+-----+-----+
|CDate               |IdKey|Value|
+--------------------+-----+-----+
|2018-05-08T00:00::00|2    |1    |
|2018-05-08T00:00::00|61   |2    |
+--------------------+-----+-----+

You can let the spark to infer schema itself will get the same result

val df = spark.sqlContext.read.format("com.databricks.spark.xml")
  .option("rowTag", "Item")
  //.schema(schema)
  .load(xmlFile)
  .withColumn("ItemData", explode($"ListItemData.ItemData"))
  .select("CDate", "ItemData.*")

Hope this helps!

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks @Shankar I am testing this solution. It seems correct.
Do you know if is possible to configure more than one rowTag? For cases with different xml tags.
I am not sure if this can be done, you can see the option here. github.com/databricks/spark-xml#features
0
you can do something like this this will output //( 2018-05-08T00:00::00   2 1   61 2   ,2018-05-08T00:00::00)
then you can format as you want i think, it will help.

  object XMLDemo extends App {
  val xmlElem: Elem = <Item>
  <CDate>2018-05-08T00:00::00</CDate>
  <ListItemData>
  <ItemData>
  <IdKeyData>2</IdKeyData>
  <Value>1</Value>
  </ItemData>
  <ItemData>
  <IdKeyData>61</IdKeyData>
  <Value>2</Value>
  </ItemData>
  </ListItemData>
  </Item>

  val lb: ListBuffer[String] = ListBuffer()
  val date: NodeSeq = xmlElem \\ "CDate"

  val r: immutable.Seq[String] = xmlElem.map {
    x => x.text
  }

  println(r.mkString(" ").replaceAll(" ","").replaceAll("\n"," "), date.text)
  }

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.