parse many xml objects and write them into a text file (csv, tab-del, ...) using python

Question

I have many XML objects of the following format:

<GetSingleItemResponse xmlns="urn:ebay:apis:eBLBaseComponents">
  <Timestamp>2012-10-25T03:09:50.817Z</Timestamp>
  <Ack>Success</Ack>
  <Build>E795_CORE_BUNDLED_15430047_R1</Build>
  <Version>795</Version>
  <Item>
     <Description>...</Description>
     <ItemID>330810813385</ItemID>
     <EndTime>2012-10-25T04:32:37.000Z</EndTime>
     <Location>Paypal Prefered</Location>
     <GalleryURL>...</GalleryURL>
     <PictureURL>...</PictureURL>
     <PictureURL>...</PictureURL>
     <PrimaryCategoryID>177</PrimaryCategoryID>
     <PrimaryCategoryName>
     Computers/Tablets & Networking:Laptops & Netbooks:PC Laptops & Netbooks
     </PrimaryCategoryName>
     <BidCount>2</BidCount>
     <ConvertedCurrentPrice currencyID="USD">294.99</ConvertedCurrentPrice>
     <ListingStatus>Active</ListingStatus>
     <TimeLeft>PT1H22M47S</TimeLeft>
     <Title>
     HP Compaq ZD8000 3800Mhz Full Loaded Ready to go, nice unit & super fast Laptop
     </Title>
     <ShippingCostSummary>
     <ShippingServiceCost currencyID="USD">23.99</ShippingServiceCost>
     <ShippingType>Flat</ShippingType>
     <ListedShippingServiceCost currencyID="USD">23.99</ListedShippingServiceCost>
     </ShippingCostSummary>
     <ItemSpecifics>
        <NameValueList>
           <Name>Operating System</Name>
           <Value>Windows XP Professional</Value>
        </NameValueList>
        <NameValueList>
           <Name>Screen Size</Name>
           <Value>17.0</Value>
        </NameValueList>
        <NameValueList>
           <Name>Processor Type</Name>
           <Value>Intel Pentium 4 HT</Value>
        </NameValueList>
     </ItemSpecifics>
     <Country>US</Country>
     <AutoPay>false</AutoPay>
     <ConditionID>2500</ConditionID>
     <ConditionDisplayName>Seller refurbished</ConditionDisplayName>
   </Item>
</GetSingleItemResponse>

For each xml object I'd like to obtain all the item tag tags, such as itemid, endtime, etc.. and also all the item specifics tag tags, such as Operating System, Screen Size, etc. I'd like to save this into memory for each xml object into an appropriate data structure (object). Finally, I'd like to write all the information for all the xml objects into a csv file.

The difficulty is that a priori I do NOT know know what the columns (header) for the csv file will be. For the first xml object I'd create as many columns as the number of subtags the item and item specifics combined have.

Then I'd add more and more columns as new columns appear for new items, adding NAs for the columns that haven't appeared before.

I am looking for advice on how to process the xml objects, which data structures to convert (save) the xml objects, and also how to write all the finally processed xml data into a csv file.

Thanks.

Pykler · Accepted Answer · 2012-10-25 04:50:08Z

2

For each row in your csv you should create a dictionary. As you parse the xml, you should populate this dictionary for each <Item> from your snippet. While you do that you should keep a set of keys i.e. columns ... that way by the end of the file you will know how many columns you have and their titles.

Here is a little snippet of how this can be done (if the data fits in memory), I will be using BeautifulSoup since you mentioned it in your tags and its awesome:

import sys
import csv

from BeautifulSoup import BeautifulSoup as Soup

doc = Soup(xml_string)
data = []
cols = set()
for item in doc.findAll('item'):
    d = {}
    for sub in item:
        if hasattr(sub, 'name'):
            d[sub.name] = sub.text
    data.append(d)
    cols = cols.union(d.keys())

cw = csv.writer(sys.stdout)
cw.writerow(cols)
for row in data:
    cw.writerow([row.get(k, 'N/A') for k in cols])

Note, that this solution assumes that your keys are unique, but in your example the item has two picture urls, if you want both to show up it is possible (since nothing is impossible) just a little more complex.

If the data does not fit in memory, you will need to do a two pass, the first pass to gather the keys and the second to print the csv ... note, you should replace BeautifulSoup with another parser like sax in this case since the data does not fit in memory

answered Oct 25, 2012 at 4:50

Pykler

14.9k9 gold badges43 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Dnaiel Over a year ago

thanks! that is what I call an elegant and nice solution to my problem.

Dnaiel Over a year ago

@Pykler just tested it and it works great. Just as a complement, csv turns out to be a bad idea, tab delimited is a better format given the text within each tag may also have commas...

Pykler Over a year ago

@Dnaiel you can put commas in a csv ... the python library nicely handles that by using double quotes as necessary ... if you really want a tsv instead the csv writer can be configured to output tabs with dialect='excel-tab'

Collectives™ on Stack Overflow

parse many xml objects and write them into a text file (csv, tab-del, ...) using python

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related