I am trying to get some information from this page : https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2437275
where I am particularly interested in extracting the Characteristics data as follows:
group_id: xxx
medicore_id: xxxxxxx
date_of_visit_sample_drawn_date: xxxxxxx
rin: xxxxxx
donor_id: xxxxx
sle_visit_designation: xxxxxxx
bold_shipment_batch: xxxxxx
rna_concentrated: xxxxxx
subject_type: xxxxxxx
so on and so forth. Upon inspecting the page, I realize that this information is deeply nested within other larger tables and that there is no special class/id for me to effectively parse for the characteristics information. I have been unsuccessfully trying to look for table within tables but I find that sometimes not all tables are being read. This is what I have so far:
from bs4 import BeautifulSoup
import requests
source= requests.get("https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?
acc=GSM2437275").text
soup = BeautifulSoup(source, 'lxml')
table = soup.find_all('table')
for i in table:
print i.prettify()
print (len(table)) #22 tables
print (table[6].prettify()) #narrow down on relevant table
table = table[6]
table_subtables = table.find_all('table')
for i in table_subtables:
print (i.prettify())
print len(table_subtables) #14 tables
tbb = table_subtables[1]
tbb_subtable = tbb.find_all('table')
for i in tbb_subtable:
print (i.prettify())
print len(tbb_subtable) #12 tables
tbbb = tbb_subtable[5]
tbbb_subtable = tbbb.find_all('table')
for i in tbbb_subtable:
print (i.prettify())
print len(tbbb_subtable) # 6 tables
so on and so forth. However, as I keep doing this, I find that not all tables are being read. Can someone point me to a better solution?