Can't access specific content in html page with rvest and selectorGadget

Question

I'm trying to scrape a ncbi website (https://www.ncbi.nlm.nih.gov/protein/29436380) to obtain information of a protein. I need to access the gene_synonyms and GeneID fields. I have tried to find the relevant nodes with the selectorGadget addon in chrome and with the code inspector in ff. I have tried this code:

require("dplyr")
require("rvest")
require("stringr")
GIwebPage <- read_html("https://www.ncbi.nlm.nih.gov/protein/29436380")
TestHTML <- GIwebPage %>% html_node("div.grid , div#maincontent.col.nine_col , div.sequence , pre.genebank , .feature") %>% html_text(trim = TRUE)

Then I try to find the relevant text but it is simply not there.

str_extract_all(TestHTML, pattern = "(synonym).{30}")
 [[1]]
 character(0)

str_extract_all(TestHTML, pattern = "(GeneID:).{30}")
 [[1]]
 character(0)

All I seem to be accessing is some of the text content of the column on the right.

str_extract_all(TestHTML, pattern = "(protein).{30}")
 [[1]]
 [1] "protein codes including ambiguities a"
 [2] "protein sequence for myosin-9  (NP_00"
 [3] "protein should not be confused with t"
 [4] "protein, partial [Homo sapiens]gi|294"
 [5] "protein codes including ambiguities a"

I have tried so many combinations of nodes selections with html_node() that I don't know anymore what to try. Is this content buried in some structure I can't see? or I'm just not skilled enough to realize the node to select?

Thanks a lot, José.

The page is dynamically loading the information. The information you are looking for is actually stored here: ncbi.nlm.nih.gov/sviewer/…. You can find this link by using the developer tools from your browser. — Dave2e
– Dave2e, Commented Apr 16, 2020 at 13:29
Awesome Dave2e!! Thanks a lot. In fact the page is "static" enough so I can use it in a function to scrape information from several genes. One question though, how did you find that address? I can't seem to find it!! — netlak
– netlak, Commented Apr 16, 2020 at 13:40
Dave2e, I can't mark a comment as an answer but it is for me. Should I leave it as it is?? — netlak
– netlak, Commented Apr 16, 2020 at 13:42

Dave2e · Accepted Answer · 2020-04-16 13:58:18Z

1

The page is dynamically loading the information. The underlying information is store at another location.
Using the developer tools from your bowser, look for the link:

The information you are looking for is store at the "viewer.fcgi", right click to copy the link.

See similar question/answers: R not accepting xpath query

edited Apr 16, 2020 at 13:58

answered Apr 16, 2020 at 13:48

Dave2e

24.3k18 gold badges46 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Can't access specific content in html page with rvest and selectorGadget

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related