I m struggling to decode a parsing logic into a dataframe, where there is a XML data within a JSON object. I have read the JSON object successfully and stored in a dataframe like shown below, it contains a col Guest_data which is XML:
| Country | Guest_data |
|---|---|
| Romania | xml 1 |
| Hungary | xml 2 |
| Ukraine | xml 3 |
I was also able to separately read the XML file with xpath and explode functions and store the result in a separate dataframe
XML FORMAT 1
<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="AA11" age="68" sex="F" /> <visitor id="BB22" age="34" sex="M" /> <visitor id="CC33" age="23" sex="M" /> </visitors>
XML FORMAT 2
<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="FF77" age="27" sex="F" /> <visitor id="YY99" age="32" sex="M" /> </visitors>
XML format 3
<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="DD55" age="68" sex="F" /> <visitor id="LL99" age="34" sex="M" /> <visitor id="SS77" age="47" sex="M" /> <visitor id="TT00" age="30" sex="M" /> </visitors>
What I desire to achieve is below
| Country | id | age | sex |
|---|---|---|---|
| Romania | AA11 | 68 | F |
| Romania | BB22 | 34 | M |
| Romania | CC33 | 23 | M |
| Hungary | FF77 | 27 | F |
| Hungary | YY99 | 32 | M |
| Ukraine | DD55 | 68 | F |
| Ukraine | LL99 | 34 | M |
| Ukraine | SS77 | 47 | M |
| Ukraine | TT00 | 30 | M |
I wish to prepare a dataframe with the data above, so I can do an average age of the country person and run some more SQL queries.
spark-xmldatasource, and use it to parse "nested" XML as shown here: docs.databricks.com/data/data-sources/xml.html#parse-nested-xml