Parsing XML into a dataframe

Question

I am having some trouble parsing some XML. This is what the XML looks like.

<listing>
   <seller_info>
       <seller_name> cubsfantony</seller_name>
       <seller_rating> 848</seller_rating>
   </seller_info>
   <payment_types>Visa/MasterCard, Money Order/Cashiers Checks, Personal Checks, See item description for payment methods accepted
   </payment_types>
   <shipping_info>Buyer pays fixed shipping charges, Will ship to United States only
   </shipping_info>
   <buyer_protection_info>
   </buyer_protection_info>
   <auction_info>
     <current_bid>$620.00 </current_bid>
     <time_left> 4 days, 14 hours +  </time_left>
     <high_bidder> 
        <bidder_name> [email protected] </bidder_name>
        <bidder_rating>-2 </bidder_rating>
     </high_bidder>
     <num_items>1 </num_items>
     <num_bids>  12</num_bids>
     <started_at>$1.00 </started_at>
     <bid_increment> </bid_increment>
     <location> USA/Chicago</location>
     <opened> Nov-27-00 04:57:50 PST</opened>
     <closed> Dec-02-00 04:57:50 PST</closed>
     <id_num> 511601118</id_num>
     <notes>  </notes>
   </auction_info>
   <bid_history>
       <highest_bid_amount>$620.00   </highest_bid_amount>
       <quantity> 1</quantity>
   </bid_history>
   <item_info>
      <memory> 256MB PC133 SDram</memory>
      <hard_drive> 30 GB 7200 RPM IDE Hard Drive</hard_drive>
      <cpu>Pentium III 933 System  </cpu>
      <brand> </brand>
      <description> NEW Pentium III 933 System - 133 MHz BUS Speed Pentium Motherboard.....
      </description>
   </item_info>
</listing>

This is my code. I want to take text between the tags and put it into a Pandas dataframe. There are about 20 Listings in the full XML. For this code, I'm just trying to see how I can extract the text by the name of tags but I'm not sure how to go about it

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

from lxml import etree


ebay = etree.parse('ebay.xml') 
tree = ebay.getroot()


for child in tree:
    for element in child:
        person_dict = {}
        for more in element:
            if more.text != None:
                person_dict[more] = more.text.strip

titipata · Accepted Answer · 2018-03-09 06:54:53Z

1

Here, I just give an example of how to parse one given listing. If you have multiple listings, you can use for-loop to go through all of them.

from lxml import etree

listing = etree.parse('ebay.xml') 

d = {}
for e in listing.getchildren():
    for c in e.getchildren():
        if len(c.getchildren()) == 0:
            if c.tag is not None:
                d[c.tag] = c.text
        else:
            for ce in c.getchildren():
                if ce.tag is not None:
                    d[ce.tag] = ce.text

From here, you can append d to a list then using pandas in order to convert them into dataframe.

Output looks like the following

{'bid_increment': ' ',
 'bidder_name': ' [email protected] ',
 'bidder_rating': '-2 ',
 'brand': ' ',
  ...
 'seller_name': ' cubsfantony',
 'seller_rating': ' 848',
 'started_at': '$1.00 ',
 'time_left': ' 4 days, 14 hours +  '}

edited Mar 9, 2018 at 6:54

answered Mar 8, 2018 at 23:30

titipata

5,3894 gold badges39 silver badges59 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Parsing XML into a dataframe

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related