0

I'm currently ingesting an XML file with lxml and then creating a pandas dataframe from the root element. I'm essentially using this example. I'm doing this so I can do some math / undertake some modelling on the data.

The next step I'd like to achieve is being able to write the data back to the xml document. In other places in my script I've used root.insert since I can force inserting at a particular position index in order to keep the xml document neat and coherent.

Is there a way I can write out each row of the dataframe using something like root.insert(position, data) for each row in the dataframe, where the dataframes column header is the tag?

Example XML

<Root_Data>

  <SomeData></SomeData>
  <SomeOtherData></SomeOtherData>   
   
  <Weather>
    <WxId>1</WxId>
    <Temp>20></WxId>
    <WindSpeed>15</WindSpeed>
  </Weather>

  # We will insert more weather here - I can find this position index. Assume it is 3.

  <SomeMoreData></SomeMoreData>
<Root_Data>

Pandas dataframe:

ID Temp Windspeed
2  25   30
3  30   15
4  15   25

I'd offer some code I've tried so far - but I've actually come up empty handed on how to insert rows from a dataframe to the xml document without manually constructing the XML as strings myself (not great - headers might change, which is why I'd like to use column headers as the tags.).

Expected Result

<Root_Data>

  <SomeData></SomeData>
  <SomeOtherData></SomeOtherData>   
   
  <Weather>
    <WxId>1</WxId>
    <Temp>20></WxId>
    <WindSpeed>15</WindSpeed>
  </Weather>
  <Weather>
    <WxId>2</WxId>
    <Temp>25></WxId>
    <WindSpeed>30</WindSpeed>
  </Weather>
  <Weather>
    <WxId>3</WxId>
    <Temp>30></WxId>
    <WindSpeed>15</WindSpeed>
  </Weather>
  <Weather>
    <WxId>4</WxId>
    <Temp>15></WxId>
    <WindSpeed>25</WindSpeed>
  </Weather>

  <SomeMoreData></SomeMoreData>
<Root_Data>

Example code so far:

from lxml import etree
import pandas as pd

tree = etree.parse('example.xml')
root = tree.getroot()

#Load into dataframe
for node in root:
            res=[]
            df_cols = ["WxId","Temp", "WindSpeed"]
            res.append(node.attrib.get(df_cols[0]))
            for el in df_cols[1:]:
                if node is not None and node.find(el) is not None:
                    res.append(node.find(el).text)
                else:
                    res.append(None)
            rows.append({df_cols[i]: res[i]
                        for i, _ in enumerate(df_cols)})
        out_df = pd.DataFrame(rows, columns = df_cols)
        out_df = out_df[~out_df['Temp'].isnull()] #Proxy for good / bad data. Remove nulls.

#Now, write from data frame back to root so we can structure the XML before writing to file. 
# ? Unknown method

2 Answers 2

1

Another approach, In case your Columns are undefined or may increase in the future.

df = pd.read_csv('./123.csv')

root = etree.Element("root")
for rows in range(0,df.shape[0]):
    Tag = etree.Element('weather')
    for cols in range(0,df.shape[1]):
        etree.SubElement(Tag,df.iloc[rows:,cols].head().name).text = str(df.iloc[rows][cols])
    # Append Element "Tag" to the Main Root here
    root.append(Tag)

print(etree.tostring(root,encoding='Unicode'))
Sign up to request clarification or add additional context in comments.

5 Comments

Can I ask - your example works, but results in all elements being added to a single row without formatting. Is it possible to format the element being written so there are line breaks and indents like in the example. I assume it pertains to this portion of the code: etree.SubElement(Tag,df.iloc[rows:,cols].head().name).text
So, when you export that file to XML, use pretty_print for indentation. for STDOUT print, ex. print(etree.tostring(root,encoding='Unicode',pretty_print=True))
Yeah, doesn't seem to fix it, unfortunately. outfile = 'test.xml' tree.write(outfile, xml_declaration=True, standalone='yes', encoding='utf-8', pretty_print=True)
If you are modifying the existing structure a lot - check this stackoverflow.com/questions/7903759/… It requires "remove_blank_text=True"
Thanks. I was looking at another similar post but wasn't getting anywhere. I have wound up writing the file with messy tags, then importing it again and using the parser, so tree --> file --> import file again --> parser --> tree --> write tree.. I could not seem to figure out how to go from tree --> parser --> write tree without going via a file first.
1

You can use to_xml to convert your dataframe to xml:

xdata = df.rename(columns={'ID': 'WxId'})
          .to_xml(index=False, root_name='Root_Data', row_name='Weather')
>>> xdata
<?xml version='1.0' encoding='utf-8'?>
<Root_Data>
  <Weather>
    <WxId>2</WxId>
    <Temp>25</Temp>
    <Windspeed>30</Windspeed>
  </Weather>
  <Weather>
    <WxId>3</WxId>
    <Temp>30</Temp>
    <Windspeed>15</Windspeed>
  </Weather>
  <Weather>
    <WxId>4</WxId>
    <Temp>15</Temp>
    <Windspeed>25</Windspeed>
  </Weather>
</Root_Data>

Now you can use lxml to insert data before the first child Weather and the last child Weather or insert your xdata somewhere in your orginal xml file.

FYI, you can use pd.read_xml to convert your xml to a dataframe.

1 Comment

So I'm trying the following two lines and getting an error. xdata = out_df.to_xml(index=False, root_name='Root_Data', row_name='Weather') root.insert(insertPosition, xdata) Error: TypeError: Argument 'element' has incorrect type (expected lxml.etree._Element, got str). Any ideas?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.