Python: extract text from tag inside tag in XML Tree

Question

I am currently parsing a Wikipedia dump, trying to extract some useful information. The parsing takes place in XML, and I want to extract only the text / content for each page. Now I'm wondering how you can find all text inside a tag that is inside another tag. I searched for similar questions, but only found the ones having problems with a singular tag. Here is an example of what I want to achieve:

  <revision>
    <timestamp>2001-01-15T13:15:00Z</timestamp>
    <contributor>
      <username>Foobar</username>
      <id>65536</id>
    </contributor>
    <comment>I have just one thing to say!</comment>
    <text>A bunch of [[text]] here.</text>
    <minor />
  </revision>

  <example_tag>
    <timestamp>2001-01-15T13:15:00Z</timestamp>
    <contributor>
      <username>Foobar</username>
      <id>65536</id>
    </contributor>
    <comment>I have just one thing to say!</comment>
    <text>A bunch of [[text]] here.</text>
    <minor />
  </example_tag>

How can I extract the text inside the text tag, but only when it is included in the revision tree?

willeM_ Van Onsem · Accepted Answer · 2017-03-17 10:48:49Z

3

You can use the xml.etree.elementtree package for that and use an XPath query:

import xml.etree.ElementTree as ET

root = ET.fromstring(the_xml_string)
for content in root.findall('.//revision/othertag'):
    # ... process content, for instance
    print(content.text)

(where the_xml_string is a string containing the XML code).

Or obtain a list of the text elements with list comprehension:

import xml.etree.ElementTree as ET

texts = [content.text for content in ET.fromstring(the_xml_string).findall('.//revision/othertag')]

So the .text has the inner text. Note that you will have to replace othertag with the tag (for instance text). If that tag can be arbitrary deep in the revision tag, you should use .//revision//othertag as XPath query.

answered Mar 17, 2017 at 10:48

willeM_ Van Onsem

482k33 gold badges483 silver badges624 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python: extract text from tag inside tag in XML Tree

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related