1

all! I'm working on my first web scraper ever, which grabs author names, URLs, and paper names from PMC, when given a "CitedBy" page like this

My program works fine for getting the author names and the URL's, however I can only get some of the paper titles, which I suspect is due to subscripts and superscripts.

Here's what I've got so far:

    import requests
    from bs4 import BeautifulSoup
    import re

    url = 'http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2593677/citedby/?page=0'
    req = requests.get(url)
    plain_text = req.text
    soup = BeautifulSoup(plain_text, "lxml") #soup object

    titles_list = []

    for items in soup.findAll('div', {'class': 'title'}):
        title = items.string
        if title is None:
            title = ("UHOH") #Problems with some titles
        #print(title)
        titles_list.append(title)

When I run this part of my code, my scraper gives me these results:

  1. Finding and Comparing Syntenic Regions among Arabidopsis and the Outgroups Papaya, Poplar, and Grape: CoGe with Rosids
  2. UHOH
  3. Comprehensive Comparative Genomic and Transcriptomic Analyses of the Legume Genes Controlling the Nodulation Process
  4. UHOH
  5. Dosage Sensitivity of RPL9 and Concerted Evolution of Ribosomal Protein Genes in Plants

And so on for the whole page...

Some papers on this page that I get "UHOH" for are:

  • Comparative cell-specific transcriptomics reveals differentiation of C4 photosynthesis pathways in switchgrass and other C4 lineages
  • The genome sequence of the outbreeding globe artichoke constructed de novo incorporating a phase-aware low-pass sequencing strategy of F1 progeny

  • Cross-Family Translational Genomics of Abiotic Stress-Responsive Genes between Arabidopsis and Medicago truncatula

The first two I've listed here I believe are problematic because of "C4" and "F1" are actually "C subscript 4" and "F subscript 1". For the third one, "Medicago truncatula" is in an "em" HTML tag, so I suspect that this is why my scraper cannot scrape it.

The only alternative solution I've thought of is making my "soup.findAll" more specific, but that didn't end up helping me. I tried:

for items in soup.findAll('div', {'class': 'title'}):
        title = items.string
        if title is None:
            for other in soup.findAll('a', {'class': 'view'}):
                title = other.string

But sadly, this didn't work... So I'm not exactly sure how to approach this. Does anybody know how to handle special cases like these? Thank you so much!

4
  • 1
    Since you're using the lxml parser, shouldn't you be able to just use node.text_content() on the div.title > a nodes? That should behave the same for your simple cases, and handle complex cases with nested elements gracefully. Commented Mar 15, 2016 at 22:26
  • 1
    Looks like for BeautifulSoup that would be node.get_text() - but the principle is the same. Commented Mar 15, 2016 at 22:33
  • @LukasGraf You solved my problem!!! Thank you!! Commented Mar 15, 2016 at 22:35
  • You're welcome - I'm on the run so I just dropped a couple pointers, but If you want to expand that into a full answer and self-accept, I'll be happy to upvote ;-) Commented Mar 15, 2016 at 22:37

1 Answer 1

1

Thanks to @LukasGraf, I have the answer!

Since I'm using the BeautifulSoup, I can use node.get_text(). It works different from the plain, ".string" because it also returns all the text beneath a tag, which was the case for the subscripts and "em" HTML marked text.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.