0

I am having some trouble parsing some XML. This is what the XML looks like.

<listing>
   <seller_info>
       <seller_name> cubsfantony</seller_name>
       <seller_rating> 848</seller_rating>
   </seller_info>
   <payment_types>Visa/MasterCard, Money Order/Cashiers Checks, Personal Checks, See item description for payment methods accepted
   </payment_types>
   <shipping_info>Buyer pays fixed shipping charges, Will ship to United States only
   </shipping_info>
   <buyer_protection_info>
   </buyer_protection_info>
   <auction_info>
     <current_bid>$620.00 </current_bid>
     <time_left> 4 days, 14 hours +  </time_left>
     <high_bidder> 
        <bidder_name> [email protected] </bidder_name>
        <bidder_rating>-2 </bidder_rating>
     </high_bidder>
     <num_items>1 </num_items>
     <num_bids>  12</num_bids>
     <started_at>$1.00 </started_at>
     <bid_increment> </bid_increment>
     <location> USA/Chicago</location>
     <opened> Nov-27-00 04:57:50 PST</opened>
     <closed> Dec-02-00 04:57:50 PST</closed>
     <id_num> 511601118</id_num>
     <notes>  </notes>
   </auction_info>
   <bid_history>
       <highest_bid_amount>$620.00   </highest_bid_amount>
       <quantity> 1</quantity>
   </bid_history>
   <item_info>
      <memory> 256MB PC133 SDram</memory>
      <hard_drive> 30 GB 7200 RPM IDE Hard Drive</hard_drive>
      <cpu>Pentium III 933 System  </cpu>
      <brand> </brand>
      <description> NEW Pentium III 933 System - 133 MHz BUS Speed Pentium Motherboard.....
      </description>
   </item_info>
</listing>

This is my code. I want to take text between the tags and put it into a Pandas dataframe. There are about 20 Listings in the full XML. For this code, I'm just trying to see how I can extract the text by the name of tags but I'm not sure how to go about it

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

from lxml import etree


ebay = etree.parse('ebay.xml') 
tree = ebay.getroot()


for child in tree:
    for element in child:
        person_dict = {}
        for more in element:
            if more.text != None:
                person_dict[more] = more.text.strip

1 Answer 1

1

Here, I just give an example of how to parse one given listing. If you have multiple listings, you can use for-loop to go through all of them.

from lxml import etree

listing = etree.parse('ebay.xml') 

d = {}
for e in listing.getchildren():
    for c in e.getchildren():
        if len(c.getchildren()) == 0:
            if c.tag is not None:
                d[c.tag] = c.text
        else:
            for ce in c.getchildren():
                if ce.tag is not None:
                    d[ce.tag] = ce.text

From here, you can append d to a list then using pandas in order to convert them into dataframe.

Output looks like the following

{'bid_increment': ' ',
 'bidder_name': ' [email protected] ',
 'bidder_rating': '-2 ',
 'brand': ' ',
  ...
 'seller_name': ' cubsfantony',
 'seller_rating': ' 848',
 'started_at': '$1.00 ',
 'time_left': ' 4 days, 14 hours +  '}
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.