1

I'm trying to scrape a ncbi website (https://www.ncbi.nlm.nih.gov/protein/29436380) to obtain information of a protein. I need to access the gene_synonyms and GeneID fields. I have tried to find the relevant nodes with the selectorGadget addon in chrome and with the code inspector in ff. I have tried this code:

require("dplyr")
require("rvest")
require("stringr")
GIwebPage <- read_html("https://www.ncbi.nlm.nih.gov/protein/29436380")
TestHTML <- GIwebPage %>% html_node("div.grid , div#maincontent.col.nine_col , div.sequence , pre.genebank , .feature") %>% html_text(trim = TRUE)

Then I try to find the relevant text but it is simply not there.

str_extract_all(TestHTML, pattern = "(synonym).{30}")
 [[1]]
 character(0)

str_extract_all(TestHTML, pattern = "(GeneID:).{30}")
 [[1]]
 character(0)

All I seem to be accessing is some of the text content of the column on the right.

str_extract_all(TestHTML, pattern = "(protein).{30}")
 [[1]]
 [1] "protein codes including ambiguities a"
 [2] "protein sequence for myosin-9  (NP_00"
 [3] "protein should not be confused with t"
 [4] "protein, partial [Homo sapiens]gi|294"
 [5] "protein codes including ambiguities a"

I have tried so many combinations of nodes selections with html_node() that I don't know anymore what to try. Is this content buried in some structure I can't see? or I'm just not skilled enough to realize the node to select?

Thanks a lot, José.

3
  • 1
    The page is dynamically loading the information. The information you are looking for is actually stored here: ncbi.nlm.nih.gov/sviewer/…. You can find this link by using the developer tools from your browser. Commented Apr 16, 2020 at 13:29
  • Awesome Dave2e!! Thanks a lot. In fact the page is "static" enough so I can use it in a function to scrape information from several genes. One question though, how did you find that address? I can't seem to find it!! Commented Apr 16, 2020 at 13:40
  • Dave2e, I can't mark a comment as an answer but it is for me. Should I leave it as it is?? Commented Apr 16, 2020 at 13:42

1 Answer 1

1

The page is dynamically loading the information. The underlying information is store at another location.
Using the developer tools from your bowser, look for the link:

enter image description here

The information you are looking for is store at the "viewer.fcgi", right click to copy the link.

See similar question/answers: R not accepting xpath query

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.