I have a dataframe that consists of rows of data, and a column of XML that needs to be parsed. I'm able to parse that XML with the following code from this stack overflow solution:
import xml.etree.ElementTree as ET
import pyspark.sql.functions as F
@F.udf('array<struct<id:string, age:string, sex:string>>')
def parse_xml(s):
root = ET.fromstring(s)
return list(map(lambda x: x.attrib, root.findall('visitor')))
df2 = df.select(
F.explode(parse_xml('visitors')).alias('visitors')
).select('visitors.*')
df2.show()
This function creates a new dataframe of the parsed XML data.
Instead, how can I modify this function to include a column from the original dataframe so that it may be joined later?
For instance, if the original dataframe looks like:
+----+---+----------------------+
|id |a |xml |
+----+---+----------------------+
|1234|. |<row1, row2> |
|2345|. |<row3, row4>, <row5> |
|3456|. |<row6> |
+----+---+----------------------+
How can I include the ID in each of the rows of the newly-created dataframe?