1

I am trying to read an xml file in dataframe in pyspark.

Code : df_xml=spark.read.format("com.databricks.spark.xml").option("rootTag","dataset").option("rowTag","AUTHOR").load(FilePath)

when i display the dataframe, it shows a single column corrupt_records :

enter image description here

below is the xml file content

<?xml version='1.0' encoding='UTF-8'?>

<dataset>
 
 <AUTHOR AUTHOR_UID = 1>
    <FIRST_NAME>Fiona</FIRST_NAME>
    <MIDDLE_NAME/>
    <LAST_NAME>Macdonald</LAST_NAME>
 </AUTHOR>
 <AUTHOR AUTHOR_UID = 2>
    <FIRST_NAME>Gian</FIRST_NAME>
    <MIDDLE_NAME>Paolo</MIDDLE_NAME>
    <LAST_NAME>Faleschini</LAST_NAME>
 </AUTHOR>
 <AUTHOR AUTHOR_UID = 3>
    <FIRST_NAME>Laura</FIRST_NAME>
    <MIDDLE_NAME>K</MIDDLE_NAME>
    <LAST_NAME>Egendorf</LAST_NAME>
 </AUTHOR>
 <AUTHOR AUTHOR_UID = 4>
    <FIRST_NAME>Jan</FIRST_NAME>
    <MIDDLE_NAME/>
    <LAST_NAME>Grover</LAST_NAME>
 </AUTHOR>

1 Answer 1

1

That XML is not valid:

  • The AUTHOR_UID must be defined in quotes
  • The dataset tag is not closed

This example below is a valid one:

<?xml version='1.0' encoding='UTF-8'?>

<dataset>
 
 <AUTHOR AUTHOR_UID = '1'>
    <FIRST_NAME>Fiona</FIRST_NAME>
    <MIDDLE_NAME/>
    <LAST_NAME>Macdonald</LAST_NAME>
 </AUTHOR>
 <AUTHOR AUTHOR_UID = '2'>
    <FIRST_NAME>Gian</FIRST_NAME>
    <MIDDLE_NAME>Paolo</MIDDLE_NAME>
    <LAST_NAME>Faleschini</LAST_NAME>
 </AUTHOR>
 <AUTHOR AUTHOR_UID = '3'>
    <FIRST_NAME>Laura</FIRST_NAME>
    <MIDDLE_NAME>K</MIDDLE_NAME>
    <LAST_NAME>Egendorf</LAST_NAME>
 </AUTHOR>
 <AUTHOR AUTHOR_UID = '4'>
    <FIRST_NAME>Jan</FIRST_NAME>
    <MIDDLE_NAME/>
    <LAST_NAME>Grover</LAST_NAME>
 </AUTHOR>
 
 </dataset>
Sign up to request clarification or add additional context in comments.

1 Comment

thanks Luiz, it seems Author_ID value not being in the quote was causing the problem. Dataset tag was closed in the file, though i only shared a few records, so it was missing in the example.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.