Why is my data frame loaded with blank values when reading XML using spark.xml? (Python/Databricks)

Question

I am reading an XML file using spark.xml in Python and ran into a seemingly very specific problem.
I was able to narrow to down the part of the XML that is producing the problem, but not why it is happening.

Here is the code to read in the XML file:

src = sqlContext.read.format("com.databricks.spark.xml").options(rowTag="root").load("file.xml")

Here is the schema read in from the XML using the spark.xml library, with rowTag = "root":

    root
 |-- AID: long (nullable = true)
 |-- RID: long (nullable = true)
 |-- XmlData: struct (nullable = true)
 |    |-- NC: struct (nullable = true)
 |    |    |-- RS: struct (nullable = true)
 |    |    |    |-- RD: struct (nullable = true)
 |    |    |    |    |-- CR: struct (nullable = true)
 |    |    |    |    |    |-- Addr1: string (nullable = true)
 |    |    |    |    |    |-- Addr2: string (nullable = true)
 |    |    |    |    |    |-- City: string (nullable = true)
 |    |    |    |    |    |-- InFile: string (nullable = true)
 |    |    |    |    |    |-- Name: string (nullable = true)
 |    |    |    |    |    |-- Phone: long (nullable = true)
 |    |    |    |    |    |-- State: string (nullable = true)
 |    |    |    |    |    |-- Zip: long (nullable = true)
 |    |    |    |    |-- SC: struct (nullable = true)
 |    |    |    |    |    |-- Class: string (nullable = true)
 |    |    |    |    |    |-- ClassType: string (nullable = true)
 |    |    |    |    |    |-- Message: string (nullable = true)
 |    |    |    |    |    |-- SC: long (nullable = true)
 |    |    |    |    |-- NC: long (nullable = true)
 |    |    |    |    |-- CRR: string (nullable = true)
 |    |    |    |    |-- RM: struct (nullable = true)
 |    |    |    |    |    |-- Addr1: string (nullable = true)
 |    |    |    |    |    |-- City: string (nullable = true)
 |    |    |    |    |    |-- MemberNo: string (nullable = true)
 |    |    |    |    |    |-- Name: string (nullable = true)
 |    |    |    |    |    |-- State: string (nullable = true)
 |    |    |    |    |    |-- Zip: long (nullable = true)
 |    |    |    |    |-- TL: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- _AvgDays: long (nullable = true)
 |    |    |    |    |    |    |-- _Comment: string (nullable = true)
 |    |    |    |    |    |    |-- _Current: long (nullable = true)
 |    |    |    |    |    |    |-- _HC: long (nullable = true)
 |    |    |    |    |    |    |-- _IndType: string (nullable = true)
 |    |    |    |    |    |    |-- _LS: long (nullable = true)
 |    |    |    |    |    |    |-- _OpenDate: long (nullable = true)
 |    |    |    |    |    |    |-- _PD120Day: long (nullable = true)
 |    |    |    |    |    |    |-- _PD30Day: long (nullable = true)
 |    |    |    |    |    |    |-- _PD60Day: long (nullable = true)
 |    |    |    |    |    |    |-- _PD90Day: long (nullable = true)
 |    |    |    |    |    |    |-- _Region: string (nullable = true)
 |    |    |    |    |    |    |-- _ReportDate: string (nullable = true)
 |    |    |    |    |    |    |-- _SourceID: long (nullable = true)
 |    |    |    |    |    |    |-- _TD: long (nullable = true)
 |    |    |    |    |    |    |-- _VALUE: string (nullable = true)
 |    |    |    |    |-- Trends: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- _CurrentPct: double (nullable = true)
 |    |    |    |    |    |    |-- _LineCnt: long (nullable = true)
 |    |    |    |    |    |    |-- _PD120DayPct: double (nullable = true)
 |    |    |    |    |    |    |-- _PD30DayPct: double (nullable = true)
 |    |    |    |    |    |    |-- _PD60DayPct: double (nullable = true)
 |    |    |    |    |    |    |-- _PD90DayPct: double (nullable = true)
 |    |    |    |    |    |    |-- _PD: string (nullable = true)
 |    |    |    |    |    |    |-- _TD: long (nullable = true)
 |    |    |    |    |    |    |-- _VALUE: string (nullable = true)

There is more to this schema, but for some reason the XML parser does not get past this point. I am thinking there is an issue with the 'Trends' tag in the XML, but I cannot find it.

Here is a sample of a 'Trends' tag entry:

<Trends PD="4205" LineCnt="0" TD="0" CurrentPct="0" PD30DayPct="0" PD60DayPct="0" PD90DayPct="0" PD120DayPct="0" />

I understand that a closed tag such as this one could not be read in previous versions of the XML parser, but it works fine while using it in Databricks, and my other files with closed tags are read properly.

Here is the final result to further exemplify what I am explaining (I have 13 records in my XML):

+---------+--------+-------+
|AID      |RID     |XmlData|
+---------+--------+-------+
|     null|    null|   null|
|     null|    null|   null|
|     null|    null|   null|
|     null|    null|   null|
|     null|    null|   null|
|     null|    null|   null|
|     null|    null|   null|
|     null|    null|   null|
|     null|    null|   null|
|     null|    null|   null|
|     null|    null|   null|
|     null|    null|   null|
|     null|    null|   null|
|     null|    null|   null|
+---------+--------+-------+

(This top level view is fine - my code later iterates through the structs/arrays in 'XMLData' - but this needs to be populated first of course)

Does anyone know why this would cause the schema to stop being loaded to completion when parsing?

Please note that I cannot explicitly define a schema. That would defeat the purpose of the project that I am working on - I must be able to infer the schema dynamically. Which, again, works fine for other files I am using.

Did you find the solution for this issue? I am also having same issue here. I was able to find the cause but not sure how to resolve it. The issue here is with inconsistency in values of the Array type field. LEts say, one attribute in one of the array type field has some value = "123" and the same attribute has value="", then you will see all rows of your dataframe as null. I verified it at my end with multiple use case like this. I am sure your xml is also having similar kind of inconsistency in data. Please let me know. — CodeHunter
– CodeHunter, Commented Jun 17, 2019 at 23:08

CodeHunter · Accepted Answer · 2019-06-18 15:57:13Z

Cause:

So I was able to figure out why was this happening. You are likely to see this issue when the xml you are trying to convert to dataframe have inconsistent values. For example, something like below will have this issue:

<?xml version="1.0"?>
<Company>
  <Employee id="1">
      <Email>[email protected]</Email>
      <Measures id="id32" type="AttributesInContext">
<Dimensions value="7in" title="Height"></Dimensions>
<Dimensions value="" title="Weight"></Dimensions></Measures>
  </Employee>
  <Employee id="2">
      <Email>[email protected]</Email>
      <Measures id="id33" type="AttributesInContext">
<Dimensions value="6in" title="Height"></Dimensions>
<Dimensions value="" title="Weight"></Dimensions></Measures>
  </Employee>
  <Employee id="3">
      <Email>[email protected]</Email>
      <Measures id="id34" type="AttributesInContext">
<Dimensions value="4in" title="Height"></Dimensions>
<Dimensions value="" title="Weight"></Dimensions></Measures>
  </Employee>
</Company>

Here, since you have value="" in each of the rowTag entry, we will likely to get null in the dataframe for all rows as it is not able to infer the data type. However, if you replace all value="" fields with some actual value, you will see that this issue do not occur.

Solution:

As per databricks link, this issue can be resolved by using option("nullValue", "") option while reading the xml file. So your command will look like this (I tried this in scala, should be similar in python):

var xmldf = sparkSession.read.format("com.databricks.spark.xml")
        .option("rowTag", rootTag).option("nullValue", "").load("/path/to/xml")

Thanks! I had a complicated string-replace work-around, I'm glad there is a simpler option. Much appreciated.

Collectives™ on Stack Overflow

Why is my data frame loaded with blank values when reading XML using spark.xml? (Python/Databricks)

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related