0

I am trying web scraping with R, but I am having problems pulling html content from the web.

Here is an exercise I'm doing on an example page from Amazon with some queries.

library(XML)

#> Warning message:
#> XML package is in R 3.5.3 version 

my_url99 <- "https://www.amazon.com/s?k=Dell+laptop+windows+10&ref=nb_sb_noss_2"
html_page99 <- htmlTreeParse(my_url99, useInternalNode=TRUE)

#> Warning message:
#> XML content does not seem to be XML: 'https://www.amazon.com/s?k=Dell+laptop+windows+10&ref=nb_sb_noss_2' 

head(html_page99)

#> Error in `[.XMLInternalDocument`(x, seq_len(n)) : 
#>  No method for subsetting an XMLInternalDocument with integer

html_page99

#> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
#> <html><body><p>https://www.amazon.com/s?k=Dell+laptop+windows+10&amp;ref=nb_sb_noss_2</p></body></html>

But I need to scrape the above page with full content. I mean content with $ sign on the left (maybe that's not the best direct description) and all the tags.

1

1 Answer 1

1

Without a lot of experience in scraping and manipulating strings, it is difficult to get at the data you want. As @ThomasL points out, using the XML library is not the best way forward. Here is how you could achieve the results you want using the rvest library:

library(rvest)
#> Loading required package: xml2
library(tibble)

my_url99 <- "https://www.amazon.com/s?k=Dell+laptop+windows+10&ref=nb_sb_noss_2"

read_html(my_url99)                                                       %>%
html_nodes(xpath = "//div[@class = 'sg-row']")                            %>% 
html_text()                                                               %>% 
{gsub("\n", " ", .)}                                                      %>% 
{grep("5 stars", ., value = TRUE)}                                        %>% 
{grep("Sponsored", ., invert = TRUE, value = TRUE)}                       %>% 
{gsub("^ +", "", .)}                                                      %>% 
{grep("[$]", ., value = TRUE)}                                            %>% 
{gsub("[$][0123456789.]+[$]", "$", .)}                                    %>% 
strsplit(" {2,50}")                                                       %>% 
lapply(function(x) x[x != ""])                                            %>% 
lapply(function(x) { grep("Buying Choices|Ships to|in stock|new offers", 
                          x, invert = TRUE, value = TRUE)              }) %>%
lapply(function(x) if(length(x) < 4) NULL else x[c(1, 2, 4)])             %>%
{do.call(rbind, .)}                                                       %>% 
`colnames<-`(c("Model", "Rating","Price"))                                %>%
as_tibble()                                                                ->
result

Giving you a 3 column data frame (or tibble) with model, star rating and price:

result
#> # A tibble: 15 x 3
#>    Model                                                   Rating         Price 
#>    <chr>                                                   <chr>          <chr> 
#>  1 "Dell Latitude E6430 Laptop WEBCAM - HDMI - Intel Core~ 4.0 out of 5 ~ $201.~
#>  2 "New ! Dell Inspiron i3583 15.6\" HD Touch-Screen Lapt~ 4.3 out of 5 ~ $349.~
#>  3 "2019_Dell Inspiron 15.6\" HD High Performance Laptop,~ 3.9 out of 5 ~ $300.~
#>  4 "Dell Inspiron 15.6” Touch Screen Intel Core i3 128GB ~ 4.1 out of 5 ~ $365.~
#>  5 "Dell Inspiron 15.6 Inch HD Touchscreen Flagship High ~ 4.1 out of 5 ~ $443.~
#>  6 "Dell Latitude E5450 14in Laptop, Intel Core i5-5300U ~ 3.7 out of 5 ~ $225.~
#>  7 "New ! Dell Inspiron i3583 15.6\" HD Touch-Screen Lapt~ 4.5 out of 5 ~ $445.~
#>  8 "Dell 14inch High Performance Latitude 3340 Notebook, ~ 4.1 out of 5 ~ $199.~
#>  9 "2018 Dell Business Flagship Laptop Notebook 15.6\" HD~ 3.4 out of 5 ~ $567.~
#> 10 "Dell Latitude E6420 Laptop - HDMI - i5 2.5ghz - 4GB D~ 3.3 out of 5 ~ $178.~
#> 11 "Newest_Dell Vostro Real Business(Better Design Than I~ 3.6 out of 5 ~ $689.~
#> 12 "2019 Dell Inspiron 15 6\" HD Touchscreen Flagship Pre~ 4.0 out of 5 ~ $348.~
#> 13 "2019 Dell Inspiron 14\" Laptop Computer| 10th Gen Int~ 4.1 out of 5 ~ $328.~
#> 14 "2019 Dell Inspiron 15 6\" HD Touchscreen Flagship Pre~ 4.4 out of 5 ~ $442.~
#> 15 "Fast Dell Latitude E5470 HD Business Laptop Notebook ~ 4.5 out of 5 ~ $288.~

Created on 2020-02-17 by the reprex package (v0.3.0)

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks, above lead me close to solution, but I've have more problem - the code doesn't see all computers.
Why the list of computers isn't full (lack of some computers) and the list has other turn than original on Amazon page?
@Marcin it is likely that the page you see in your browser is changed according to your Amazon cookies. When you are web scraping, you don't have these cookies, so it is a different page you see. You can automate a web browser using RSelenium, but this is a fairly advanced topic if you are new to R

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.