I am reading an XML file using spark.xml in Python and ran into a seemingly very specific problem.
I was able to narrow to down the part of the XML that is producing the problem, but not why it is happening.
Here is the code to read in the XML file:
src = sqlContext.read.format("com.databricks.spark.xml").options(rowTag="root").load("file.xml")
Here is the schema read in from the XML using the spark.xml library, with rowTag = "root":
root
|-- AID: long (nullable = true)
|-- RID: long (nullable = true)
|-- XmlData: struct (nullable = true)
| |-- NC: struct (nullable = true)
| | |-- RS: struct (nullable = true)
| | | |-- RD: struct (nullable = true)
| | | | |-- CR: struct (nullable = true)
| | | | | |-- Addr1: string (nullable = true)
| | | | | |-- Addr2: string (nullable = true)
| | | | | |-- City: string (nullable = true)
| | | | | |-- InFile: string (nullable = true)
| | | | | |-- Name: string (nullable = true)
| | | | | |-- Phone: long (nullable = true)
| | | | | |-- State: string (nullable = true)
| | | | | |-- Zip: long (nullable = true)
| | | | |-- SC: struct (nullable = true)
| | | | | |-- Class: string (nullable = true)
| | | | | |-- ClassType: string (nullable = true)
| | | | | |-- Message: string (nullable = true)
| | | | | |-- SC: long (nullable = true)
| | | | |-- NC: long (nullable = true)
| | | | |-- CRR: string (nullable = true)
| | | | |-- RM: struct (nullable = true)
| | | | | |-- Addr1: string (nullable = true)
| | | | | |-- City: string (nullable = true)
| | | | | |-- MemberNo: string (nullable = true)
| | | | | |-- Name: string (nullable = true)
| | | | | |-- State: string (nullable = true)
| | | | | |-- Zip: long (nullable = true)
| | | | |-- TL: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- _AvgDays: long (nullable = true)
| | | | | | |-- _Comment: string (nullable = true)
| | | | | | |-- _Current: long (nullable = true)
| | | | | | |-- _HC: long (nullable = true)
| | | | | | |-- _IndType: string (nullable = true)
| | | | | | |-- _LS: long (nullable = true)
| | | | | | |-- _OpenDate: long (nullable = true)
| | | | | | |-- _PD120Day: long (nullable = true)
| | | | | | |-- _PD30Day: long (nullable = true)
| | | | | | |-- _PD60Day: long (nullable = true)
| | | | | | |-- _PD90Day: long (nullable = true)
| | | | | | |-- _Region: string (nullable = true)
| | | | | | |-- _ReportDate: string (nullable = true)
| | | | | | |-- _SourceID: long (nullable = true)
| | | | | | |-- _TD: long (nullable = true)
| | | | | | |-- _VALUE: string (nullable = true)
| | | | |-- Trends: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- _CurrentPct: double (nullable = true)
| | | | | | |-- _LineCnt: long (nullable = true)
| | | | | | |-- _PD120DayPct: double (nullable = true)
| | | | | | |-- _PD30DayPct: double (nullable = true)
| | | | | | |-- _PD60DayPct: double (nullable = true)
| | | | | | |-- _PD90DayPct: double (nullable = true)
| | | | | | |-- _PD: string (nullable = true)
| | | | | | |-- _TD: long (nullable = true)
| | | | | | |-- _VALUE: string (nullable = true)
There is more to this schema, but for some reason the XML parser does not get past this point. I am thinking there is an issue with the 'Trends' tag in the XML, but I cannot find it.
Here is a sample of a 'Trends' tag entry:
<Trends PD="4205" LineCnt="0" TD="0" CurrentPct="0" PD30DayPct="0" PD60DayPct="0" PD90DayPct="0" PD120DayPct="0" />
I understand that a closed tag such as this one could not be read in previous versions of the XML parser, but it works fine while using it in Databricks, and my other files with closed tags are read properly.
Here is the final result to further exemplify what I am explaining (I have 13 records in my XML):
+---------+--------+-------+
|AID |RID |XmlData|
+---------+--------+-------+
| null| null| null|
| null| null| null|
| null| null| null|
| null| null| null|
| null| null| null|
| null| null| null|
| null| null| null|
| null| null| null|
| null| null| null|
| null| null| null|
| null| null| null|
| null| null| null|
| null| null| null|
| null| null| null|
+---------+--------+-------+
(This top level view is fine - my code later iterates through the structs/arrays in 'XMLData' - but this needs to be populated first of course)
Does anyone know why this would cause the schema to stop being loaded to completion when parsing?
Please note that I cannot explicitly define a schema. That would defeat the purpose of the project that I am working on - I must be able to infer the schema dynamically. Which, again, works fine for other files I am using.