0

I have multiple XML files containing tweets in a format similar to the one below:

<tweet idtweet='xxxxxxx'> 
    <topic>#irony</topic> 
    <date>20171109T03:39</date> 
    <hashtag>#irony</hashtag> 
    <irony>1</irony> 
    <emoji>Laughing with tears</emoji> 
    <nbreponse>0</nbreponse> 
    <nbretweet>0</nbretweet> 
    <textbrut> Some text here <img class="Emoji Emoji--forText" src="source.png" draggable="false" alt="😁" title="Laughing with tears" aria-label="Emoji: Laughing with tears"></img> #irony </textbrut> 
    <text>Some text here #irony </text> 
</tweet>

There is a problem with the way the files were created (the closing tag for img is missing) so I made the choice of closing it as in the above example. I know that in HTML you can close it as

<img **something here** /> 

but I don't know if this holds for XML, as I didn't see it anywhere.

I'm writing a python code that extracts the topic and the plain text, but I am also interested in all the attributes contained by img and I don't seem able to do it. Here is what I've tried so far:

top = []
txt = []
emj = []

for article in root:
    topic = article.find('.topic')
    textbrut = article.find('.textbrut')

    emoji = article.find('.img')
    everything = textbrut.attrib

    if topic is not None and textbrut is not None:
            top.append(topic.text)
            txt.append(textbrut.text)

            x = list(everything.items())
            emj.append(x)

Any help would be greatly appreciated.

2
  • Out of curiosity, which XML parser are you using? lxml or something else? Commented Oct 21, 2019 at 12:04
  • @Basj I'm using ElementTree. I'm quite new to this... Is there a problem with it? Commented Oct 21, 2019 at 12:11

2 Answers 2

1

Apparently, Element has some useful methods (such as Element.iter()) that help iterate recursively over all the sub-tree below it (its children, their children,...). So here is the solution that worked for me:

for emoji in root.iter('img'):
    print(emoji.attrib)
    everything = emoji.attrib
    x = list(everything.items())
    new.append(x)

For more details read here.

Sign up to request clarification or add additional context in comments.

Comments

0

Below

import xml.etree.ElementTree as ET

xml = '''<t><tweet idtweet='xxxxxxx'> 
    <topic>#irony</topic> 
    <date>20171109T03:39</date> 
    <hashtag>#irony</hashtag> 
    <irony>1</irony> 
    <emoji>Laughing with tears</emoji> 
    <nbreponse>0</nbreponse> 
    <nbretweet>0</nbretweet> 
    <textbrut> Some text here <img class="Emoji Emoji--forText" src="source.png" draggable="false" alt="😁" title="Laughing with tears" aria-label="Emoji: Laughing with tears"></img> #irony </textbrut> 
    <text>Some text here #irony </text> 
</tweet></t>'''

root = ET.fromstring(xml)
data = []
for tweet in root.findall('.//tweet'):
    data.append({'topic': tweet.find('./topic').text, 'text': tweet.find('./text').text,
                 'img_attributes': tweet.find('.//img').attrib})
print(data)

output

[{'topic': '#irony', 'text': 'Some text here #irony ', 'img_attributes': {'class': 'Emoji Emoji--forText', 'src': 'source.png', 'draggable': 'false', 'alt': '😁', 'title': 'Laughing with tears', 'aria-label': 'Emoji: Laughing with tears'}}]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.