0

I am trying to get some information from this page : https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2437275

where I am particularly interested in extracting the Characteristics data as follows:

group_id: xxx
medicore_id: xxxxxxx
date_of_visit_sample_drawn_date: xxxxxxx
rin: xxxxxx
donor_id: xxxxx
sle_visit_designation: xxxxxxx
bold_shipment_batch: xxxxxx
rna_concentrated: xxxxxx
subject_type: xxxxxxx

so on and so forth. Upon inspecting the page, I realize that this information is deeply nested within other larger tables and that there is no special class/id for me to effectively parse for the characteristics information. I have been unsuccessfully trying to look for table within tables but I find that sometimes not all tables are being read. This is what I have so far:

from bs4 import BeautifulSoup
import requests

source= requests.get("https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?
acc=GSM2437275").text


soup = BeautifulSoup(source, 'lxml')
table = soup.find_all('table') 
for i in table:
  print i.prettify()
print (len(table)) #22 tables

print (table[6].prettify()) #narrow down on relevant table
table = table[6]

table_subtables = table.find_all('table')
for i in table_subtables:
   print (i.prettify())

print len(table_subtables) #14 tables

tbb = table_subtables[1] 

tbb_subtable = tbb.find_all('table')
for i in tbb_subtable:
  print (i.prettify())
print len(tbb_subtable) #12 tables

tbbb = tbb_subtable[5] 

    tbbb_subtable = tbbb.find_all('table')
for i in tbbb_subtable:
  print (i.prettify())
print len(tbbb_subtable) # 6 tables

so on and so forth. However, as I keep doing this, I find that not all tables are being read. Can someone point me to a better solution?

2 Answers 2

1

You can scrape the data with regular expressions and urllib to specifically scrape the keywords and their corresponding values:

import re
import urllib 
data = str(urllib.urlopen('https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2437275').read())
target_vals = ['group_id', 'medicore_id', 'date_of_visit_sample_drawn_date', 'rin', 'donor_id', 'sle_visit_designation', 'bold_shipment_batch', 'rna_concentrated', 'subject_type']
final_data = {i:re.findall('(?<={}:\s)\w+'.format(i), data)[0] for i in target_vals}

Output:

{
 'date_of_visit_sample_drawn_date': '2009', 
 'rna_concentrated': 'No', 
  'sle_visit_designation': 'Baseline', 
  'rin': '8', 
  'subject_type': 'Patient', 
  'donor_id': '19', 
  'bold_shipment_batch': '1', 
  'medicore_id': 'B0019V1', 
  'group_id': 'A'
}

Edit: given multiple links, you can create a pandas dataframe out of the generated data for each:

import re
import urllib
import pandas as pd
def get_data_from_links(link, target_vals=['group_id', 'medicore_id', 'date_of_visit_sample_drawn_date', 'rin', 'donor_id', 'sle_visit_designation', 'bold_shipment_batch', 'rna_concentrated', 'subject_type']):
    data = str(urllib.urlopen(link).read())
    return {i:re.findall('(?<={}:\s)\w+'.format(i), data)[0] for i in target_vals}
returned_data = get_data_from_links('https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2437275')
df = pd.DataFrame([returned_data])

Output:

  bold_shipment_batch date_of_visit_sample_drawn_date donor_id group_id  \
  0                   1                            2009       19        A   

  medicore_id rin rna_concentrated sle_visit_designation subject_type  
      0     B0019V1   8               No              Baseline      Patient 

If you have a list of links you would like to retrieve your data from, you can construct a table by constructing a nested dictionary of the resulting data to pass to DataFrame.from_dict:

link_lists = ['link1', 'link2', 'link3']
final_data = {i:get_data_from_links(i) for i in link_lists}
new_table = pd.DataFrame.from_dict(final_data, orient='index')

Output (assuming the first link is 'https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2437275'):

       rin rna_concentrated date_of_visit_sample_drawn_date  \
link1   8               No                            2009   

  sle_visit_designation bold_shipment_batch group_id subject_type  \
link1              Baseline                   1        A      Patient   

  medicore_id donor_id  
link1     B0019V1       19  
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you so much, this was very helpful! Is there a way I can store the descriptions such as "date_of_visit_sample_drawn_date" as a column name and the corresponding value under that column? Also, it I were to provide a vector of several links to the variable data, would it store all the information under these column headings from all the given webpages?
@NithishaKh glad to help! Please see my recent edit.
1

The way Ajax1234 has shown in his solution is definitely the best way to go with. However, if hardcoded index is not a barrier and if you wish to avoid using regex to achieve the same then this is another approach you may think of trying:

from bs4 import BeautifulSoup
import requests

res = requests.get("https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2437275")
soup = BeautifulSoup(res.content, 'lxml')
for items in soup.select("td[style*='justify']")[2:3]:
    data = '\n'.join([item for item in items.strings][:9])
    print(data)

Output:

group_id: A
medicore_id: B0019V1
date_of_visit_sample_drawn_date: 2009-09-14
rin: 8.5
donor_id: 19
sle_visit_designation: Baseline
bold_shipment_batch: 1
rna_concentrated: No
subject_type: Patient

1 Comment

Thank you so much, I have been reading up on using BeautifulSoup so having this alternative solution is very helpful!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.