Webscraping in R using CSS selector

Question

I am trying to scrape some data off a shoe website called footlocker.com I have the following code, where I am trying to extract the number of 'xyz' brand shoes on sale and the total number of those shoes.

library(rvest)
webpage <- 
read_html("https://www.footlocker.com/category/brands/adidas.html? 
query=adidas%3Arelevance%3AproductType%3A200005")
webpage

#Using CSS selectors to scrape the sale section
sale_count_html <- html_nodes(webpage, 'li:nth-child(1) .miscellaneous 
.count')
sale_count <- html_text(sale_count_html)
sale_count <- as.numeric(sale_count)
head(sale_count)


total_count_html <- html_nodes(webpage,'strong+ strong')
total_count <- html_text(total_count_html)
head(total_count)

It is giving me character(0) for sale_count whereas on the website it is a 3 digit number. And for total_count, it is giving me a totally different number than what is on the website

The web page probably loads data via javascript after it activates in the browser. Simple web scraping doesn't run javascript. Maybe you can use something like RSelenium to run that code for you. — MrFlick
– MrFlick, Commented Sep 18, 2018 at 16:07
What you're actually doing is violating the terms of service — footlocker.com/help/terms-of-use.html — and encouraging others to do so and potentially end up in legal trouble. — hrbrmstr
– hrbrmstr, Commented Sep 18, 2018 at 17:57

Emmanuel Hamel · Accepted Answer · 2021-12-13 19:11:43Z

I have been able to extract the product names and product prices with the following code :

library(RSelenium)
library(stringr)
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate('https://www.footlocker.com/category/brands/adidas.html?query=adidas%3Arelevance%3AproductType%3A200005')

# The four lines below are to remove the pop-up windows 
webElem <- remDr$findElement("id", "bluecoreEmailCaptureSubmit")
webElem$submitElement()
webElem <- remDr$findElement("id", "touAgreeBtn")
webElem$ClickElement()

page_Content <- remDr$getPageSource()[[1]]

# Here, we extract the information related to the shoes with regular expressions
text <- str_extract(page_Content, "<span class=\"ProductName\"(.*)(\\$\\d{1,5}\\.\\d{0,2})")
text_Split <- strsplit(text, split = "<span class=\"ProductName\">")[[1]]
text_Split <- text_Split[-1]

product_Name <- str_extract_all(string = text_Split, pattern = "<span class=\"ProductName-primary\">[^<]*</span>")

pattern_Product_Price <- c("(<span class=\"ProductPrice\"><span>\\$\\d{1,5}\\.\\d{0,2})",
                           "(<span class=\"ProductPrice-final\" aria-hidden=\"true\">\\$\\d{1,5}\\.\\d{0,2})",
                          "(<span class=\"ProductPrice-original\" aria-hidden=\"true\">\\$\\d{1,5}\\.\\d{0,2})")

regex_Product_Price <- paste0(pattern_Product_Price, collapse = "|")
  
product_Price <- str_extract_all(string = text_Split, pattern = regex_Product_Price)

From this information, you can count the number of pairs of shoes.

Collectives™ on Stack Overflow

Webscraping in R using CSS selector

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related