2

I am trying to extract some information out of a tei file, using this code:

tree = ET.parse(path)
root = tree.getroot()
body = root.find("{http://www.tei-c.org/ns/1.0}text/{http://www.tei-c.org/ns/1.0}body")  
for s in body.iter("{http://www.tei-c.org/ns/1.0}s"):
    for w in s.iter("{http://www.tei-c.org/ns/1.0}w"):
        wordpart = w.find("{http://www.tei-c.org/ns/1.0}seg")
        word = ''.join(wordpart.itertext())
        type = w.get('type')
        xml = w.get('xml:id') 
        print(type)             
        print(xml)

The output for type is correct, it prints e.g. "noun". But for xml:id I can only get None. This is an extract of the xml-file I need to parse:

<w type="noun" xml:id="w.4940"><seg type="orth">sloterheighe</seg>...
2
  • Why are there two quotation marks at the end of xml:id="w.4940""? Commented Apr 30, 2019 at 10:33
  • Small mistake, I edited it, thank you Commented Apr 30, 2019 at 12:06

1 Answer 1

2

To get the value of the xml:id attribute, you need to specify the namespace URI like this (see this SO post for more details):

xml = w.attrib['{http://www.w3.org/XML/1998/namespace}id']

or

xml = w.get('{http://www.w3.org/XML/1998/namespace}id')

Also, note that type is a built-in method in Python, so avoid using it as a variable name.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.