I have multiple XML files containing tweets in a format similar to the one below:
<tweet idtweet='xxxxxxx'>
<topic>#irony</topic>
<date>20171109T03:39</date>
<hashtag>#irony</hashtag>
<irony>1</irony>
<emoji>Laughing with tears</emoji>
<nbreponse>0</nbreponse>
<nbretweet>0</nbretweet>
<textbrut> Some text here <img class="Emoji Emoji--forText" src="source.png" draggable="false" alt="😁" title="Laughing with tears" aria-label="Emoji: Laughing with tears"></img> #irony </textbrut>
<text>Some text here #irony </text>
</tweet>
There is a problem with the way the files were created (the closing tag for img is missing) so I made the choice of closing it as in the above example. I know that in HTML you can close it as
<img **something here** />
but I don't know if this holds for XML, as I didn't see it anywhere.
I'm writing a python code that extracts the topic and the plain text, but I am also interested in all the attributes contained by img and I don't seem able to do it. Here is what I've tried so far:
top = []
txt = []
emj = []
for article in root:
topic = article.find('.topic')
textbrut = article.find('.textbrut')
emoji = article.find('.img')
everything = textbrut.attrib
if topic is not None and textbrut is not None:
top.append(topic.text)
txt.append(textbrut.text)
x = list(everything.items())
emj.append(x)
Any help would be greatly appreciated.
lxmlor something else?