Extract all attributes of an element from XML in Python

Question

I have multiple XML files containing tweets in a format similar to the one below:

<tweet idtweet='xxxxxxx'> 
    <topic>#irony</topic> 
    <date>20171109T03:39</date> 
    <hashtag>#irony</hashtag> 
    <irony>1</irony> 
    <emoji>Laughing with tears</emoji> 
    <nbreponse>0</nbreponse> 
    <nbretweet>0</nbretweet> 
    <textbrut> Some text here <img class="Emoji Emoji--forText" src="source.png" draggable="false" alt="😁" title="Laughing with tears" aria-label="Emoji: Laughing with tears"></img> #irony </textbrut> 
    <text>Some text here #irony </text> 
</tweet>

There is a problem with the way the files were created (the closing tag for img is missing) so I made the choice of closing it as in the above example. I know that in HTML you can close it as

<img **something here** />

but I don't know if this holds for XML, as I didn't see it anywhere.

I'm writing a python code that extracts the topic and the plain text, but I am also interested in all the attributes contained by img and I don't seem able to do it. Here is what I've tried so far:

top = []
txt = []
emj = []

for article in root:
    topic = article.find('.topic')
    textbrut = article.find('.textbrut')

    emoji = article.find('.img')
    everything = textbrut.attrib

    if topic is not None and textbrut is not None:
            top.append(topic.text)
            txt.append(textbrut.text)

            x = list(everything.items())
            emj.append(x)

Any help would be greatly appreciated.

Out of curiosity, which XML parser are you using? lxml or something else? — Basj
– Basj, Commented Oct 21, 2019 at 12:04
@Basj I'm using ElementTree. I'm quite new to this... Is there a problem with it? — patri
– patri, Commented Oct 21, 2019 at 12:11

patri · Accepted Answer · 2019-10-21 12:16:27Z

1

Apparently, Element has some useful methods (such as Element.iter()) that help iterate recursively over all the sub-tree below it (its children, their children,...). So here is the solution that worked for me:

for emoji in root.iter('img'):
    print(emoji.attrib)
    everything = emoji.attrib
    x = list(everything.items())
    new.append(x)

For more details read here.

edited Oct 21, 2019 at 12:16

answered Oct 21, 2019 at 12:00

patri

3531 gold badge4 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

balderman · Accepted Answer · 2019-10-21 13:55:23Z

Below

import xml.etree.ElementTree as ET

xml = '''<t><tweet idtweet='xxxxxxx'> 
    <topic>#irony</topic> 
    <date>20171109T03:39</date> 
    <hashtag>#irony</hashtag> 
    <irony>1</irony> 
    <emoji>Laughing with tears</emoji> 
    <nbreponse>0</nbreponse> 
    <nbretweet>0</nbretweet> 
    <textbrut> Some text here <img class="Emoji Emoji--forText" src="source.png" draggable="false" alt="😁" title="Laughing with tears" aria-label="Emoji: Laughing with tears"></img> #irony </textbrut> 
    <text>Some text here #irony </text> 
</tweet></t>'''

root = ET.fromstring(xml)
data = []
for tweet in root.findall('.//tweet'):
    data.append({'topic': tweet.find('./topic').text, 'text': tweet.find('./text').text,
                 'img_attributes': tweet.find('.//img').attrib})
print(data)

output

[{'topic': '#irony', 'text': 'Some text here #irony ', 'img_attributes': {'class': 'Emoji Emoji--forText', 'src': 'source.png', 'draggable': 'false', 'alt': '😁', 'title': 'Laughing with tears', 'aria-label': 'Emoji: Laughing with tears'}}]

Collectives™ on Stack Overflow

Extract all attributes of an element from XML in Python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related