Parse xml in Python 3.x

Question

I have some xml code I wish to parse. I wish to use ElementTree and not BeautifulSoup as I am having some issues with the latter the way it handles the xml.

I wish to extract the text from the following:

Abstract/AbstractText
ArticleId when IdType="pmc"
PublicationType with 'UI' value to be retrieved first before retrieving the corresponding text

Which functions of ElementTree do I use to do the work?

I have been trying to use .attrib, attrib.get(), .iter, .attrib[key] to get the text but I have not been successful in accessing the actual text.

<PubmedArticleSet>
   <PubmedArticle>
       <PMID Version="1">10890875</PMID>
       <Journal>
           <ISSN IssnType="Print">0143-005X</ISSN>
            <Title>Journal of epidemiology and community health</Title>
       </Journal>
       <ArticleTitle>Sources of influence on medical practice. 
       </ArticleTitle>
       <Abstract>
          <AbstractText Label="OBJECTIVES" NlmCategory="OBJECTIVE">
             To explore the opinion of general practitioners on the 
             importance and legitimacy of sources of influence on 
             medical practice.
          </AbstractText>
          <AbstractText Label="METHODS" NlmCategory="METHODS">
             General practitioners (n=723) assigned to Primary Care 
             Teams (PCTs) in two Spanish regions were randomly selected 
             to participate in this study. 
          </AbstractText>
          <AbstractText Label="RESULTS" NlmCategory="RESULTS">
The most important and legitimate sources of influence according to general practitioners were: training courses and scientific articles, designing self developed protocols and discussing with colleagues. 
          </AbstractText>
          <AbstractText Label="CONCLUSIONS" NlmCategory="CONCLUSIONS">
The development of medical practice is determined by many factors, grouped around three big areas: organisational setting, professional system and social setting. </AbstractText>
        </Abstract>
        <Language>eng</Language>
        <PublicationTypeList>
           <PublicationType UI="D016428">Journal Article 
           </PublicationType>
           <PublicationType UI="D013485">Research Support, Non-U.S.Gov't </PublicationType>
        </PublicationTypeList>
    <PubmedData>
         <PublicationStatus>ppublish</PublicationStatus>
         <ArticleIdList>
            <ArticleId IdType="pubmed">10890875</ArticleId>
            <ArticleId IdType="pmc">PMC1731730</ArticleId>
         </ArticleIdList>
     </PubmedData>
   </PubmedArticle>
</PubmedArticleSet>

What I am hoping to get as a result is: generating every "label" of AbstractText getting the text for that "label"

Can you please add an example of your desired output?

Daniel Haley
– Daniel Haley

2019-04-24 12:07:38 +00:00
Commented Apr 24, 2019 at 12:07 — Daniel Haley
– Daniel Haley, Commented Apr 24, 2019 at 12:07

KunduK · Accepted Answer · 2019-04-24 08:42:31Z

Try the following code with Css Selector.

from bs4 import BeautifulSoup

html='''<PubmedArticleSet>
   <PubmedArticle>
       <PMID Version="1">10890875</PMID>
       <Journal>
           <ISSN IssnType="Print">0143-005X</ISSN>
            <Title>Journal of epidemiology and community health</Title>
       </Journal>
       <ArticleTitle>Sources of influence on medical practice. 
       </ArticleTitle>
       <Abstract>
          <AbstractText Label="OBJECTIVES" NlmCategory="OBJECTIVE">
             To explore the opinion of general practitioners on the 
             importance and legitimacy of sources of influence on 
             medical practice.
          </AbstractText>
          <AbstractText Label="METHODS" NlmCategory="METHODS">
             General practitioners (n=723) assigned to Primary Care 
             Teams (PCTs) in two Spanish regions were randomly selected 
             to participate in this study. 
          </AbstractText>
          <AbstractText Label="RESULTS" NlmCategory="RESULTS">
The most important and legitimate sources of influence according to general practitioners were: training courses and scientific articles, designing self developed protocols and discussing with colleagues. 
          </AbstractText>
          <AbstractText Label="CONCLUSIONS" NlmCategory="CONCLUSIONS">
The development of medical practice is determined by many factors, grouped around three big areas: organisational setting, professional system and social setting. </AbstractText>
        </Abstract>
        <Language>eng</Language>
        <PublicationTypeList>
           <PublicationType UI="D016428">Journal Article 
           </PublicationType>
           <PublicationType UI="D013485">Research Support, Non-U.S.Gov't </PublicationType>
        </PublicationTypeList>
    <PubmedData>
         <PublicationStatus>ppublish</PublicationStatus>
         <ArticleIdList>
            <ArticleId IdType="pubmed">10890875</ArticleId>
            <ArticleId IdType="pmc">PMC1731730</ArticleId>
         </ArticleIdList>
     </PubmedData>
   </PubmedArticle>
</PubmedArticleSet>'''

soup = BeautifulSoup(html, 'lxml')

maintag=soup.select_one('Abstract')
for childtag in maintag.select('AbstractText'):
    print(childtag.text.strip())

print(soup.select_one('ArticleId[IdType="pmc"]').text)

Output:

To explore the opinion of general practitioners on the 
             importance and legitimacy of sources of influence on 
             medical practice.
General practitioners (n=723) assigned to Primary Care 
             Teams (PCTs) in two Spanish regions were randomly selected 
             to participate in this study.
The most important and legitimate sources of influence according to general practitioners were: training courses and scientific articles, designing self developed protocols and discussing with colleagues.
The development of medical practice is determined by many factors, grouped around three big areas: organisational setting, professional system and social setting.
PMC1731730

eandklahn · Accepted Answer · 2019-05-05 05:58:06Z

In general, I have had great use of the .find()-method for looking through XML-files that have been parsed with ElementTree. And then for whatever you find, you can use the element.text, element.attrib and element.tag to get text, a dictionary of attributes and the element name respectively.

Combine that with list comprehension, and it sounds like that's what you're looking for.

As an example, let's say you have the xml-file saved as 'publications.xml':

import xml.etree.ElementTree as ET

filename = 'publications.xml'
content = ET.parse(filename)
root = content.getroot()

abstracts = [a.text for a in root.find('PubmedArticle/Abstract')]

will give you a list of the text in the 4 abstracts.

Accessing all the ID's can be done in a similar way, adding the check for the correct IdType. By the method mentioned above, you can similarly get the list of all elements with name 'ArticleId' and then access the IdType using

element.attrib['IdType']

for each element in the given list.

For the last request, I'm not entirely sure what you mean by retrieving UI-value first. If you just want to make sure that you retrieve both values, you can loop through all the elements in

root.find('PubmedArticle/PublicationTypeList')

and save both element.attrib['UI'] and element.text

Collectives™ on Stack Overflow

Parse xml in Python 3.x

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related