0

I have a dataframe that consists of rows of data, and a column of XML that needs to be parsed. I'm able to parse that XML with the following code from this stack overflow solution:

import xml.etree.ElementTree as ET
import pyspark.sql.functions as F

@F.udf('array<struct<id:string, age:string, sex:string>>')
def parse_xml(s):
    root = ET.fromstring(s)
    return list(map(lambda x: x.attrib, root.findall('visitor')))
    
df2 = df.select(
    F.explode(parse_xml('visitors')).alias('visitors')
).select('visitors.*')

df2.show()

This function creates a new dataframe of the parsed XML data.

Instead, how can I modify this function to include a column from the original dataframe so that it may be joined later?

For instance, if the original dataframe looks like:

+----+---+----------------------+
|id  |a  |xml                   |
+----+---+----------------------+
|1234|.  |<row1, row2>          |
|2345|.  |<row3, row4>, <row5>  |
|3456|.  |<row6>                |
+----+---+----------------------+

How can I include the ID in each of the rows of the newly-created dataframe?

1 Answer 1

0

You need to also select the id column when you construct df2. I think you can do something like:

df2 = df.select('id',
    F.explode(parse_xml('visitors')).alias('visitors')
).select('id','visitors.*')

Here is a small self-contained example that demonstrates the idea:

import pyspark.sql.functions as F
df = spark.createDataFrame([(1,["xml1", "xml2", "xml3"]), (2,["xml4", "xml5", "xml6"]),(3,["xml7", "xml8", "xml9"])], ["id", "xml"])
df.show()
df_exploded_with_id = df.select("id", F.explode(F.col("xml")))
df_exploded_with_id.show()

Output:

+---+------------------+
| id|               xml|
+---+------------------+
|  1|[xml1, xml2, xml3]|
|  2|[xml4, xml5, xml6]|
|  3|[xml7, xml8, xml9]|
+---+------------------+

+---+----+
| id| col|
+---+----+
|  1|xml1|
|  1|xml2|
|  1|xml3|
|  2|xml4|
|  2|xml5|
|  2|xml6|
|  3|xml7|
|  3|xml8|
|  3|xml9|
+---+----+

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.