all! I'm working on my first web scraper ever, which grabs author names, URLs, and paper names from PMC, when given a "CitedBy" page like this
My program works fine for getting the author names and the URL's, however I can only get some of the paper titles, which I suspect is due to subscripts and superscripts.
Here's what I've got so far:
import requests
from bs4 import BeautifulSoup
import re
url = 'http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2593677/citedby/?page=0'
req = requests.get(url)
plain_text = req.text
soup = BeautifulSoup(plain_text, "lxml") #soup object
titles_list = []
for items in soup.findAll('div', {'class': 'title'}):
title = items.string
if title is None:
title = ("UHOH") #Problems with some titles
#print(title)
titles_list.append(title)
When I run this part of my code, my scraper gives me these results:
- Finding and Comparing Syntenic Regions among Arabidopsis and the Outgroups Papaya, Poplar, and Grape: CoGe with Rosids
- UHOH
- Comprehensive Comparative Genomic and Transcriptomic Analyses of the Legume Genes Controlling the Nodulation Process
- UHOH
- Dosage Sensitivity of RPL9 and Concerted Evolution of Ribosomal Protein Genes in Plants
And so on for the whole page...
Some papers on this page that I get "UHOH" for are:
- Comparative cell-specific transcriptomics reveals differentiation of C4 photosynthesis pathways in switchgrass and other C4 lineages
The genome sequence of the outbreeding globe artichoke constructed de novo incorporating a phase-aware low-pass sequencing strategy of F1 progeny
Cross-Family Translational Genomics of Abiotic Stress-Responsive Genes between Arabidopsis and Medicago truncatula
The first two I've listed here I believe are problematic because of "C4" and "F1" are actually "C subscript 4" and "F subscript 1". For the third one, "Medicago truncatula" is in an "em" HTML tag, so I suspect that this is why my scraper cannot scrape it.
The only alternative solution I've thought of is making my "soup.findAll" more specific, but that didn't end up helping me. I tried:
for items in soup.findAll('div', {'class': 'title'}):
title = items.string
if title is None:
for other in soup.findAll('a', {'class': 'view'}):
title = other.string
But sadly, this didn't work... So I'm not exactly sure how to approach this. Does anybody know how to handle special cases like these? Thank you so much!
lxmlparser, shouldn't you be able to just usenode.text_content()on thediv.title > anodes? That should behave the same for your simple cases, and handle complex cases with nested elements gracefully.node.get_text()- but the principle is the same.