Iterating an XML file and extracting data from it

Question

I have this XML file:

<movie id = 0> 
  <Movie_name>The Shawshank Redemption   </Movie_name> 
  <Address>http://www.imdb.com/title/tt0111161/
  </Address> 
  <year>1994  </year> 
  <stars>Tim Robbins  Morgan Freeman  Bob Gunton    </stars> 
  <plot> plot...
  </plot> 
  <keywords>Reviews, Showtimes</keywords>
</movie>

<movie id = 1> 
  <Movie_name>Inglourious Basterds   </Movie_name> 
  <Address>http://www.imdb.com/title/tt0361748/
  </Address> 
  <year>2009  </year> 
  <stars>Brad Pitt  M&#xE9;lanie Laurent  Christoph Waltz    </stars> 
  <plot>plot/... 
  </plot> 
  <keywords>Reviews, credits  </keywords>
</movie>

How can iterate the file extracting for each movie its speciffic data? I mean for movie 0: its name, address, year and so on.

The input file structure is mandatory, so data extraction can be done while looping.

Much thanks.

daedalus · Accepted Answer · 2012-04-28 12:20:24Z

EDIT -- taking on board the improved XML input

I would strongly recommend trying to validate your input as in the comment by @Lattyware. I find that with invalid XML and HTML, BeautifulSoup does a good job of recovering something usable. Here is what it does with a quick try:

from BeautifulSoup import BeautifulSoup

# Note: I have added the <movielist> root element
xml = """<movielist>
<movie id = 0> 
  <Movie_name>The Shawshank Redemption   </Movie_name> 
  <Address>http://www.imdb.com/title/tt0111161/
  </Address> 
  <year>1994  </year> 
  <stars>Tim Robbins  Morgan Freeman  Bob Gunton    </stars> 
  <plot> plot...
  </plot> 
  <keywords>Reviews, Showtimes</keywords>
</movieNum>

<movie id = 1> 
  <Movie_name>Inglourious Basterds   </Movie_name> 
  <Address>http://www.imdb.com/title/tt0361748/
  </Address> 
  <year>2009  </year> 
  <stars>Brad Pitt  M&#xE9;lanie Laurent  Christoph Waltz    </stars> 
  <plot>plot/... 
  </plot> 
  <keywords>Reviews, credits  </keywords>
</movieNum>

</movielist>"""

soup = BeautifulSoup(xml)
movies = soup.findAll('movie')

for movie in movies:
    id_tag = movie['id']
    name = movie.find("movie_name").text
    url = movie.find("address").text
    year = movie.find("year").text
    stars = movie.find("stars").text
    plot = movie.find("plot").text
    keywords = movie.find("keywords").text
    for item in (id_tag, name, url, year, stars, plot, keywords):
        print item
    print '=' * 50

This will output the following (the ID tag is accessible now):

0
The Shawshank Redemption
http://www.imdb.com/title/tt0111161/
1994
Tim Robbins  Morgan Freeman  Bob Gunton
plot...
Reviews, Showtimes
==================================================
1
Inglourious Basterds
http://www.imdb.com/title/tt0361748/
2009
Brad Pitt  M&#xE9;lanie Laurent  Christoph Waltz
plot/...
Reviews, credits
==================================================

It hopefully gives you a start... It can only get better from here.

Gareth Latty · Accepted Answer · 2012-04-28 11:38:34Z

You will want to check out xml.etree.ElementTree.

I'd also note that what you have there is not valid XML, so you might run into issues. Valid XML would probably look more like this:

<movie id="0"> 
  <name>The Shawshank Redemption</name> 
  <url>http://www.imdb.com/title/tt0111161/</url> 
  <year>1994</year> 
  <stars>
    <star>Tim Robbins</star>
    <star>Morgan Freeman</star>
    <star>Bob Gunton</star>
  </stars> 
  <plot>plot...</plot> 
  <keywords>
    <keyword>Reviews</keyword>
    <keyword>Showtimes</keyword>
  </keywords>
</movie>

Note the lowercase tag names and attributes (<movieNum = 0> doesn't make sense). You will also want an XML declaration (like <?xml version="1.0" encoding="UTF-8" ?>) at the top. You can validate your XML at XML Validation, or using xmllint, for example.

Once you have valid XML, you can parse it and iterate over it using iterparse(), or parse it and then iterate over the constructed element tree.

pepr · Accepted Answer · 2012-04-28 12:32:58Z

The BeutifulSoup is more forgiving and it can be also used for HTML (where some enclosing tags are optional). The ElementTree can be used only if the XML is valid. You can make it partly valid by wrapping the fragment to a single element. The attribute values must be enclosed in quotes. Try the following approach where the Movie class was created to capture the information from one movie element. The class is derived from dict to be as flexible as dict; however, you can create your own methods to return processed values from the collected information:

# -*- coding: utf-8 -*-
import xml.etree.ElementTree as ET

class Movie(dict):

    def __init__(self, movie_element):
        assert movie_element.tag == 'movie'  # we are able to process only that
        self['id'] = movie_element.attrib['id']  
        for e in movie_element:
            self[e.tag] = e.text.strip()

    def name(self):
        return self['Movie_name']

    def url(self):
        return self['Address']

    def year(self):
        return self['year']     

    def stars(self):
        return self['stars']

    def plot(self):
        return self['plot']

    def keywords(self):
        return self['keywords']

    def __str__(self):
        lst = []
        lst.append(self.name() + ' (' + self.year() + ')')
        lst.append(self.stars())
        lst.append(self.url())
        return '\n'.join(lst)


fragment = '''\
<movie id = "0"> 
  <Movie_name>The Shawshank Redemption   </Movie_name> 
  <Address>http://www.imdb.com/title/tt0111161/
  </Address> 
  <year>1994  </year> 
  <stars>Tim Robbins  Morgan Freeman  Bob Gunton    </stars> 
  <plot> plot...
  </plot> 
  <keywords>Reviews, Showtimes</keywords>
</movie>

<movie id = "1"> 
  <Movie_name>Inglourious Basterds   </Movie_name> 
  <Address>http://www.imdb.com/title/tt0361748/
  </Address> 
  <year>2009  </year> 
  <stars>Brad Pitt  Melanie Laurent  Christoph Waltz    </stars> 
  <plot>plot/... 
  </plot> 
  <keywords>Reviews, credits  </keywords>
</movie>
'''

fixed_fragment = '<root>\n' + fragment + '</root>'
##print fixed_fragment

tree = ET.fromstring(fixed_fragment)
movies = []
for m in tree:
    movies.append(Movie(m))

for movie in movies:
    print '\n------------------'
    print movie

It prints on my console:

------------------
The Shawshank Redemption (1994)
Tim Robbins  Morgan Freeman  Bob Gunton
http://www.imdb.com/title/tt0111161/

------------------
Inglourious Basterds (2009)
Brad Pitt  Melanie Laurent  Christoph Waltz
http://www.imdb.com/title/tt0361748/

Notice that I have replaced the non-ASCII characters -- the problem with encoding is to be solved separately.

Collectives™ on Stack Overflow

Iterating an XML file and extracting data from it

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related