Web scraping a hidden table using Python

Question

I am trying to scrape the "Traits" table from this website https://www.ebi.ac.uk/gwas/genes/SAMD12 (actually, the URL can change according to my necessity, but the structure will be the same).

The problem is that my knowledge is quite limited in web scraping, and I can't get this table using the basic BeautifulSoup workflow I've seen up to here.

Here's my code:

import requests
from bs4 import BeautifulSoup

url = 'https://www.ebi.ac.uk/gwas/genes/SAMD12'
page = requests.get(url)

I'm looking for the "efotrait-table":

efotrait = soup.find('div', id='efotrait-table-loading')
print(efotrait.prettify())

<div class="row" id="efotrait-table-loading" style="margin-top:20px">
 <div class="panel panel-default" id="efotrait_panel">
  <div class="panel-heading background-color-primary-accent">
   <h3 class="panel-title">
    <span class="efotrait_label">
     Traits
    </span>
    <span class="efotrait_count badge available-data-btn-badge">
    </span>
   </h3>
   <span class="pull-right">
    <span class="clickable" onclick="toggleSidebar('#efotrait_panel span.clickable')" style="margin-left:25px">
     <span class="glyphicon glyphicon-chevron-up">
     </span>
    </span>
   </span>
  </div>
  <div class="panel-body">
   <table class="table table-striped borderless" data-export-types="['csv']" data-filter-control="true" data-flat="true" data-icons="icons" data-search="true" data-show-columns="true" data-show-export="true" data-show-multi-sort="false" data-sort-name="numberAssociations" data-sort-order="desc" id="efotrait-table">
   </table>
  </div>
 </div>
</div>

Specifically, this one:

soup.select('table#efotrait-table')[0]

<table class="table table-striped borderless" data-export-types="['csv']" data-filter-control="true" data-flat="true" data-icons="icons" data-search="true" data-show-columns="true" data-show-export="true" data-show-multi-sort="false" data-sort-name="numberAssociations" data-sort-order="desc" id="efotrait-table">
</table>

As you can see, the table's content doesn't show up. In the website, there's an option for saving the table as csv. It would be awesome if I get this downloadable link somehow. But when I click in the link in order to copy it, I get "javascript:void(0)" instead. I've not studied javascript, should I?

The table is hidden, and even if it's not, I would need to interactively select more rows per page to get the whole table (and the URL doesn't change, so I can't get the table either).

I would like to know a way to get access to this table programmatically (unstructured info), then the minors about organizing the table will be fine. Any clues for how doing that (or what I should study) will be greatly appreciated.

Thanks in advance

@jaibalaji, I'll definitely try this! Do you think, without an API, this is the only option? Or at least the must-to-go? — Cainã Max Couto da Silva
– Cainã Max Couto da Silva, Commented May 17, 2020 at 0:32

αԋɱҽԃ αмєяιcαη · Accepted Answer · 2020-05-16 12:29:25Z

Desired data is available within API call.

import requests

data = {
    "q": "ensemblMappedGenes: \"SAMD12\" OR association_ensemblMappedGenes: \"SAMD12\"",
    "max": "99999",
    "group.limit": "99999",
    "group.field": "resourcename",
    "facet.field": "resourcename",
    "hl.fl": "shortForm,efoLink",
    "hl.snippets": "100",
    "fl": "accessionId,ancestralGroups,ancestryLinks,associationCount,association_rsId,authorAscii_s,author_s,authorsList,betaDirection,betaNum,betaUnit,catalogPublishDate,chromLocation,chromosomeName,chromosomePosition,context,countriesOfRecruitment,currentSnp,efoLink,ensemblMappedGenes,fullPvalueSet,genotypingTechnologies,id,initialSampleDescription,label,labelda,mappedLabel,mappedUri,merged,multiSnpHaplotype,numberOfIndividuals,orPerCopyNum,orcid_s,pValueExponent,pValueMantissa,parent,positionLinks,publication,publicationDate,publicationLink,pubmedId,qualifier,range,region,replicateSampleDescription,reportedGene,resourcename,riskFrequency,rsId,shortForm,snpInteraction,strongestAllele,studyId,synonym,title,traitName,traitName_s,traitUri,platform",
    "raw": "fq:resourcename:association or resourcename:study"
}


def main(url):
    r = requests.post(url, data=data).json()
    print(r)


main("https://www.ebi.ac.uk/gwas/api/search/advancefilter")

You can follow the r.keys() and load your desired data by access the dict.

But here's a quick load (Lazy Code):

import requests
import re
import pandas as pd

data = {
    "q": "ensemblMappedGenes: \"SAMD12\" OR association_ensemblMappedGenes: \"SAMD12\"",
    "max": "99999",
    "group.limit": "99999",
    "group.field": "resourcename",
    "facet.field": "resourcename",
    "hl.fl": "shortForm,efoLink",
    "hl.snippets": "100",
    "fl": "accessionId,ancestralGroups,ancestryLinks,associationCount,association_rsId,authorAscii_s,author_s,authorsList,betaDirection,betaNum,betaUnit,catalogPublishDate,chromLocation,chromosomeName,chromosomePosition,context,countriesOfRecruitment,currentSnp,efoLink,ensemblMappedGenes,fullPvalueSet,genotypingTechnologies,id,initialSampleDescription,label,labelda,mappedLabel,mappedUri,merged,multiSnpHaplotype,numberOfIndividuals,orPerCopyNum,orcid_s,pValueExponent,pValueMantissa,parent,positionLinks,publication,publicationDate,publicationLink,pubmedId,qualifier,range,region,replicateSampleDescription,reportedGene,resourcename,riskFrequency,rsId,shortForm,snpInteraction,strongestAllele,studyId,synonym,title,traitName,traitName_s,traitUri,platform",
    "raw": "fq:resourcename:association or resourcename:study"
}


def main(url):
    r = requests.post(url, data=data)
    match = {item.group(2, 1) for item in re.finditer(
        r'traitName_s":\"(.*?)\".*?mappedLabel":\["(.*?)\"', r.text)}
    df = pd.DataFrame.from_dict(match)
    print(df)


main("https://www.ebi.ac.uk/gwas/api/search/advancefilter")

Output:

0              heel bone mineral density                          Heel bone mineral density
1              interleukin-8 measurement  Chronic obstructive pulmonary disease-related ...
2   self reported educational attainment        Educational attainment (years of education)
3                        waist-hip ratio                                    Waist-hip ratio
4             eye morphology measurement                                     Eye morphology
5                       CC16 measurement  Chronic obstructive pulmonary disease-related ...
6         age-related hearing impairment  Age-related hearing impairment (SNP x SNP inte...
7    eosinophil percentage of leukocytes               Eosinophil percentage of white cells
8          coronary artery calcification  Coronary artery calcified atherosclerotic plaq...
9                     multiple sclerosis                                 Multiple sclerosis
10                  mathematical ability                    Highest math class taken (MTAG)
11                 risk-taking behaviour                      General risk tolerance (MTAG)
12         coronary artery calcification  Coronary artery calcified atherosclerotic plaq...
13  self reported educational attainment                      Educational attainment (MTAG)
14                          pancreatitis                                       Pancreatitis
15               hair colour measurement                                         Hair color
16                      breast carcinoma  Breast cancer specific mortality in breast cancer
17                      eosinophil count                                  Eosinophil counts
18                     self rated health                                  Self-rated health
19                          bone density                               Bone mineral density

Hi, thanks very much! Really! I'm diving into that dictionary now... In fact, I've seen the GWAS API (ebi.ac.uk/gwas/docs/api), but I didn't find a way to use the gene name (or ID) as an input. So I wonder, how did you find out this?
@CainãMaxCouto-Silva check that and you will understand stackoverflow.com/a/61515665/7658985

Collectives™ on Stack Overflow

Web scraping a hidden table using Python

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related