Scraping text with subscripts with BeautifulSoup in Python

Question

all! I'm working on my first web scraper ever, which grabs author names, URLs, and paper names from PMC, when given a "CitedBy" page like this

My program works fine for getting the author names and the URL's, however I can only get some of the paper titles, which I suspect is due to subscripts and superscripts.

Here's what I've got so far:

    import requests
    from bs4 import BeautifulSoup
    import re

    url = 'http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2593677/citedby/?page=0'
    req = requests.get(url)
    plain_text = req.text
    soup = BeautifulSoup(plain_text, "lxml") #soup object

    titles_list = []

    for items in soup.findAll('div', {'class': 'title'}):
        title = items.string
        if title is None:
            title = ("UHOH") #Problems with some titles
        #print(title)
        titles_list.append(title)

When I run this part of my code, my scraper gives me these results:

Finding and Comparing Syntenic Regions among Arabidopsis and the Outgroups Papaya, Poplar, and Grape: CoGe with Rosids
UHOH
Comprehensive Comparative Genomic and Transcriptomic Analyses of the Legume Genes Controlling the Nodulation Process
UHOH
Dosage Sensitivity of RPL9 and Concerted Evolution of Ribosomal Protein Genes in Plants

And so on for the whole page...

Some papers on this page that I get "UHOH" for are:

Comparative cell-specific transcriptomics reveals differentiation of C4 photosynthesis pathways in switchgrass and other C4 lineages
The genome sequence of the outbreeding globe artichoke constructed de novo incorporating a phase-aware low-pass sequencing strategy of F1 progeny
Cross-Family Translational Genomics of Abiotic Stress-Responsive Genes between Arabidopsis and Medicago truncatula

The first two I've listed here I believe are problematic because of "C4" and "F1" are actually "C subscript 4" and "F subscript 1". For the third one, "Medicago truncatula" is in an "em" HTML tag, so I suspect that this is why my scraper cannot scrape it.

The only alternative solution I've thought of is making my "soup.findAll" more specific, but that didn't end up helping me. I tried:

for items in soup.findAll('div', {'class': 'title'}):
        title = items.string
        if title is None:
            for other in soup.findAll('a', {'class': 'view'}):
                title = other.string

But sadly, this didn't work... So I'm not exactly sure how to approach this. Does anybody know how to handle special cases like these? Thank you so much!

Since you're using the lxml parser, shouldn't you be able to just use node.text_content() on the div.title > a nodes? That should behave the same for your simple cases, and handle complex cases with nested elements gracefully. — Lukas Graf
– Lukas Graf, Commented Mar 15, 2016 at 22:26
Looks like for BeautifulSoup that would be node.get_text() - but the principle is the same. — Lukas Graf
– Lukas Graf, Commented Mar 15, 2016 at 22:33
You're welcome - I'm on the run so I just dropped a couple pointers, but If you want to expand that into a full answer and self-accept, I'll be happy to upvote ;-) — Lukas Graf
– Lukas Graf, Commented Mar 15, 2016 at 22:37

SnarkShark · Accepted Answer · 2016-03-15 22:42:56Z

1

Thanks to @LukasGraf, I have the answer!

Since I'm using the BeautifulSoup, I can use node.get_text(). It works different from the plain, ".string" because it also returns all the text beneath a tag, which was the case for the subscripts and "em" HTML marked text.

answered Mar 15, 2016 at 22:42

SnarkShark

3802 gold badges9 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Scraping text with subscripts with BeautifulSoup in Python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related