0

I have some xml code I wish to parse. I wish to use ElementTree and not BeautifulSoup as I am having some issues with the latter the way it handles the xml.

I wish to extract the text from the following:

  • Abstract/AbstractText
  • ArticleId when IdType="pmc"
  • PublicationType with 'UI' value to be retrieved first before retrieving the corresponding text

Which functions of ElementTree do I use to do the work?

I have been trying to use .attrib, attrib.get(), .iter, .attrib[key] to get the text but I have not been successful in accessing the actual text.

<PubmedArticleSet>
   <PubmedArticle>
       <PMID Version="1">10890875</PMID>
       <Journal>
           <ISSN IssnType="Print">0143-005X</ISSN>
            <Title>Journal of epidemiology and community health</Title>
       </Journal>
       <ArticleTitle>Sources of influence on medical practice. 
       </ArticleTitle>
       <Abstract>
          <AbstractText Label="OBJECTIVES" NlmCategory="OBJECTIVE">
             To explore the opinion of general practitioners on the 
             importance and legitimacy of sources of influence on 
             medical practice.
          </AbstractText>
          <AbstractText Label="METHODS" NlmCategory="METHODS">
             General practitioners (n=723) assigned to Primary Care 
             Teams (PCTs) in two Spanish regions were randomly selected 
             to participate in this study. 
          </AbstractText>
          <AbstractText Label="RESULTS" NlmCategory="RESULTS">
The most important and legitimate sources of influence according to general practitioners were: training courses and scientific articles, designing self developed protocols and discussing with colleagues. 
          </AbstractText>
          <AbstractText Label="CONCLUSIONS" NlmCategory="CONCLUSIONS">
The development of medical practice is determined by many factors, grouped around three big areas: organisational setting, professional system and social setting. </AbstractText>
        </Abstract>
        <Language>eng</Language>
        <PublicationTypeList>
           <PublicationType UI="D016428">Journal Article 
           </PublicationType>
           <PublicationType UI="D013485">Research Support, Non-U.S.Gov't </PublicationType>
        </PublicationTypeList>
    <PubmedData>
         <PublicationStatus>ppublish</PublicationStatus>
         <ArticleIdList>
            <ArticleId IdType="pubmed">10890875</ArticleId>
            <ArticleId IdType="pmc">PMC1731730</ArticleId>
         </ArticleIdList>
     </PubmedData>
   </PubmedArticle>
</PubmedArticleSet>

What I am hoping to get as a result is: generating every "label" of AbstractText getting the text for that "label"

1
  • Can you please add an example of your desired output? Commented Apr 24, 2019 at 12:07

2 Answers 2

2

Try the following code with Css Selector.

from bs4 import BeautifulSoup

html='''<PubmedArticleSet>
   <PubmedArticle>
       <PMID Version="1">10890875</PMID>
       <Journal>
           <ISSN IssnType="Print">0143-005X</ISSN>
            <Title>Journal of epidemiology and community health</Title>
       </Journal>
       <ArticleTitle>Sources of influence on medical practice. 
       </ArticleTitle>
       <Abstract>
          <AbstractText Label="OBJECTIVES" NlmCategory="OBJECTIVE">
             To explore the opinion of general practitioners on the 
             importance and legitimacy of sources of influence on 
             medical practice.
          </AbstractText>
          <AbstractText Label="METHODS" NlmCategory="METHODS">
             General practitioners (n=723) assigned to Primary Care 
             Teams (PCTs) in two Spanish regions were randomly selected 
             to participate in this study. 
          </AbstractText>
          <AbstractText Label="RESULTS" NlmCategory="RESULTS">
The most important and legitimate sources of influence according to general practitioners were: training courses and scientific articles, designing self developed protocols and discussing with colleagues. 
          </AbstractText>
          <AbstractText Label="CONCLUSIONS" NlmCategory="CONCLUSIONS">
The development of medical practice is determined by many factors, grouped around three big areas: organisational setting, professional system and social setting. </AbstractText>
        </Abstract>
        <Language>eng</Language>
        <PublicationTypeList>
           <PublicationType UI="D016428">Journal Article 
           </PublicationType>
           <PublicationType UI="D013485">Research Support, Non-U.S.Gov't </PublicationType>
        </PublicationTypeList>
    <PubmedData>
         <PublicationStatus>ppublish</PublicationStatus>
         <ArticleIdList>
            <ArticleId IdType="pubmed">10890875</ArticleId>
            <ArticleId IdType="pmc">PMC1731730</ArticleId>
         </ArticleIdList>
     </PubmedData>
   </PubmedArticle>
</PubmedArticleSet>'''

soup = BeautifulSoup(html, 'lxml')

maintag=soup.select_one('Abstract')
for childtag in maintag.select('AbstractText'):
    print(childtag.text.strip())

print(soup.select_one('ArticleId[IdType="pmc"]').text)

Output:

To explore the opinion of general practitioners on the 
             importance and legitimacy of sources of influence on 
             medical practice.
General practitioners (n=723) assigned to Primary Care 
             Teams (PCTs) in two Spanish regions were randomly selected 
             to participate in this study.
The most important and legitimate sources of influence according to general practitioners were: training courses and scientific articles, designing self developed protocols and discussing with colleagues.
The development of medical practice is determined by many factors, grouped around three big areas: organisational setting, professional system and social setting.
PMC1731730
Sign up to request clarification or add additional context in comments.

Comments

0

In general, I have had great use of the .find()-method for looking through XML-files that have been parsed with ElementTree. And then for whatever you find, you can use the element.text, element.attrib and element.tag to get text, a dictionary of attributes and the element name respectively.

Combine that with list comprehension, and it sounds like that's what you're looking for.

As an example, let's say you have the xml-file saved as 'publications.xml':

import xml.etree.ElementTree as ET

filename = 'publications.xml'
content = ET.parse(filename)
root = content.getroot()

abstracts = [a.text for a in root.find('PubmedArticle/Abstract')]

will give you a list of the text in the 4 abstracts.

Accessing all the ID's can be done in a similar way, adding the check for the correct IdType. By the method mentioned above, you can similarly get the list of all elements with name 'ArticleId' and then access the IdType using

element.attrib['IdType']

for each element in the given list.

For the last request, I'm not entirely sure what you mean by retrieving UI-value first. If you just want to make sure that you retrieve both values, you can loop through all the elements in

root.find('PubmedArticle/PublicationTypeList')

and save both element.attrib['UI'] and element.text

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.