Modifying UDF in Spark to Create additional key column

Question

I have a dataframe that consists of rows of data, and a column of XML that needs to be parsed. I'm able to parse that XML with the following code from this stack overflow solution:

import xml.etree.ElementTree as ET
import pyspark.sql.functions as F

@F.udf('array<struct<id:string, age:string, sex:string>>')
def parse_xml(s):
    root = ET.fromstring(s)
    return list(map(lambda x: x.attrib, root.findall('visitor')))
    
df2 = df.select(
    F.explode(parse_xml('visitors')).alias('visitors')
).select('visitors.*')

df2.show()

This function creates a new dataframe of the parsed XML data.

Instead, how can I modify this function to include a column from the original dataframe so that it may be joined later?

For instance, if the original dataframe looks like:

+----+---+----------------------+
|id  |a  |xml                   |
+----+---+----------------------+
|1234|.  |<row1, row2>          |
|2345|.  |<row3, row4>, <row5>  |
|3456|.  |<row6>                |
+----+---+----------------------+

How can I include the ID in each of the rows of the newly-created dataframe?

fskj · Accepted Answer · 2021-12-22 09:26:55Z

You need to also select the id column when you construct df2. I think you can do something like:

df2 = df.select('id',
    F.explode(parse_xml('visitors')).alias('visitors')
).select('id','visitors.*')

Here is a small self-contained example that demonstrates the idea:

import pyspark.sql.functions as F
df = spark.createDataFrame([(1,["xml1", "xml2", "xml3"]), (2,["xml4", "xml5", "xml6"]),(3,["xml7", "xml8", "xml9"])], ["id", "xml"])
df.show()
df_exploded_with_id = df.select("id", F.explode(F.col("xml")))
df_exploded_with_id.show()

Output:

+---+------------------+
| id|               xml|
+---+------------------+
|  1|[xml1, xml2, xml3]|
|  2|[xml4, xml5, xml6]|
|  3|[xml7, xml8, xml9]|
+---+------------------+

+---+----+
| id| col|
+---+----+
|  1|xml1|
|  1|xml2|
|  1|xml3|
|  2|xml4|
|  2|xml5|
|  2|xml6|
|  3|xml7|
|  3|xml8|
|  3|xml9|
+---+----+

Collectives™ on Stack Overflow

Modifying UDF in Spark to Create additional key column

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related