web scraping in r with SelectorGadget

Question

I was running this simple code below to scrape the employee number from this Fortune 500 page. I used the Chrome's extention: SelectorGadget to identify that the number I want matches with ".info__row--7f9lE:nth-child(13) .info__value--2AHH7"

library(rvest)
library(dplyr)
#download google chrome extention: SelectorGadget
link = "https://fortune.com/company/walmart/"
page = read_html(link)
Employees = page %>% html_nodes(".info__row--7f9lE:nth-child(13) .info__value--2AHH7") %>% html_text()
Employees

However, it returned "character(0)". Does anyone know what is the cause? I feel it must be a simple mistake somewhere. Thanks in advance!

Update

Here is the code I modified based on Jon's comments.

a <- c("https://fortune.com/company/walmart/", "https://fortune.com/company/amazon-com/"              
,"https://fortune.com/company/apple/"                   
,"https://fortune.com/company/cvs-health/"              
,"https://fortune.com/company/unitedhealth-group/"      
, "https://fortune.com/company/berkshire-hathaway/"      
, "https://fortune.com/company/mckesson/"                
,"https://fortune.com/company/amerisourcebergen/"       
, "https://fortune.com/company/alphabet/"                
, "https://fortune.com/company/exxon-mobil/"             
,"https://fortune.com/company/att/"                     
,"https://fortune.com/company/costco/"                  
,"https://fortune.com/company/cigna/"                   
, "https://fortune.com/company/cardinal-health/"         
,"https://fortune.com/company/microsoft/"               
,"https://fortune.com/company/walgreens-boots-alliance/"
,"https://fortune.com/company/kroger/"                  
, "https://fortune.com/company/home-depot/"              
,"https://fortune.com/company/jpmorgan-chase/"          
,"https://fortune.com/company/verizon/"                 
,"https://fortune.com/company/ford-motor/"              
, "https://fortune.com/company/general-motors/"          
,"https://fortune.com/company/anthem/"                  
, "https://fortune.com/company/centene/"                 
,"https://fortune.com/company/fannie-mae/"              
, "https://fortune.com/company/comcast/"                 
, "https://fortune.com/company/chevron/"                 
,"https://fortune.com/company/dell-technologies/"       
,"https://fortune.com/company/bank-of-america-corp/"    
,"https://fortune.com/company/target/")


find_by_name <- function(list_data, name, elem = NULL) {
  idx <- which(sapply(list_data, \(x) x$name) == name, arr.ind = TRUE)
  stopifnot(length(idx) > 0)
  if (length(idx) > 1) { idx <- idx[1] }
  dat <- list_data[[idx]]
  if (is.null(elem)) dat else dat[[elem]]
}

numEmp <- numeric()

for (i in 1:length(a)){
  json_data <- read_html(a[i]) |>
    html_element("script#preload") |> 
    html_text() |>
    sub("\\s*window\\.__PRELOADED_STATE__ = ", "", x = _, perl = TRUE) |>
    sub(";\\s*$", "", x = _, perl = TRUE) |>
    fromJSON(simplifyVector = FALSE)
  
  
  
  temp<-gsub(".*https://fortune.com", "", a[i])
  page_data <- json_data$components$page[[temp]]
  
  info_data <- page_data |> 
    find_by_name("body", "children") |>
    find_by_name("company-about-wrapper", "children") |>
    find_by_name("company-information", "config")
  
  
  numEmp[i] <- info_data$employees # Results will be fed into this numEmp variable.
}
numEmp

An error says

Error in find_by_name(page_data, "body", "children") : length(idx) > 0 is not TRUE

Should I somehow change the code stopifnot(length(idx) > 0)?

Are you sure this is supposed to work? It seems that data on fortune.com is dynamically produced/rendered using JavaScript (and therefore not present in the html source). I imagine you'd need something like RSelenium for this. — Maurits Evers
– Maurits Evers, Commented Jul 11, 2022 at 1:57

Jon Manese · Accepted Answer · 2022-07-12 03:40:20Z

3

When I do document.querySelectorAll(".info__row--7f9lE:nth-child(13) .info__value--2AHH7") I see you want to scrape the # of employees. Maurits is right, looks like the data is downloaded as (inline) JSON and then rendered later. You can use Selenium to save the rendered page then apply your CSS selector there. Or you can extract the inline JSON and scrape it from there.

After some manual work, you can do the 2nd option like below in R 4.2.x

library(rvest)
library(jsonlite)

# R 4.1.x
sub2 <- function(x, pattern, replacement) sub(pattern, replacement, x = x, perl = TRUE)

url <- "https://fortune.com/company/walmart/"
json_data <- read_html(url) |>
  html_element("script#preload") |> 
  html_text() |>
  ## sub("\\s*window\\.__PRELOADED_STATE__ = ", "", x = _, perl = TRUE) |> # R 4.2.x
  sub2("\\s*window\\.__PRELOADED_STATE__ = ", "") |>                       # R 4.1.x
  ## sub(";\\s*$", "", x = _, perl = TRUE) |>  # R 4.2.x
  sub2(";\\s*$", "") |>                        # R 4.1.x
  fromJSON(simplifyVector = FALSE)

page_data <- json_data$components$page[["/company/walmart/"]]

find_by_name <- function(list_data, name, elem = NULL) {
  idx <- which(sapply(list_data, \(x) x$name) == name, arr.ind = TRUE)
  stopifnot(length(idx) > 0)
  if (length(idx) > 1) { idx <- idx[1] }
  dat <- list_data[[idx]]
  if (is.null(elem)) dat else dat[[elem]]
}

info_data <- page_data |> 
  find_by_name("body", "children") |>
  find_by_name("company-about-wrapper", "children") |>
  find_by_name("company-information", "config")

info_data$employees
#> [1] "2300000"

# Extra code to scrape company-data-table segments
library(purrr)
data_tables <- page_data |>
  find_by_name("body", "children") |>
  find_by_name("company-about-wrapper", "children") |>
  find_by_name("company-table-wrapper", "children")

rows <- data_tables |>
  lapply(\(x) c(x$config$data, x$config$change)) |>
  purrr::flatten() |>
  discard(~ is.null(.$key))

df <- data.frame(
  key = rows |> map_chr(~ .$key),
  title = rows |> map_chr(~ .$fieldMeta$title),
  type = rows |> map_chr(~ .$fieldMeta$type),
  value = rows |> map_chr(~ .$value)
)

edited Jul 12, 2022 at 3:40

answered Jul 11, 2022 at 4:34

Jon Manese

3711 silver badge5 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Xian Zhao Over a year ago

Thanks, Jon for your response and for pointing out two directions! I was trying to run your code above. However, the underscore symbol in "x = ," was marked as an error in my RStudio. And I got the following error message: "Error: unexpected input in: " html_text() |> sub("\\s*window\\.__PRELOADED_STATE_ = ", "", x = _"" Do you have any thought on what is causing the error?

Jon Manese Over a year ago

You're probably using R 4.1.x since _ placeholder does not work (new in R 4.2.0) while |> & \(x) x$name work. I recommend you upgrade to 4.2. If you can't, see sub2 in the revised code. Basically rearrange the args to make it compatible with pipes. If you don't like pipes you can expand it one transformation-at-a-time using intermediate variables.

Xian Zhao Over a year ago

Jon, thank you so much for your reply! One more question, I'm really new to this: If I want to scrape other numbers on this same webpage (e.g., Measure Up Rank's "20"), how should I modify the code you drafted above? (I tried but couldn't figure this out 😭; I really appreciate your help!)

Xian Zhao Over a year ago

Jon, sorry I have to scrape almost all the table data. Something I have tried but didn't work:

info_data <- page_data |>    find_by_name("body", "children") |>   find_by_name("company-about-wrapper", "children") |>   find_by_name("company-data-table", "children")|>   find_by_name("table", "children") |>   find_by_name("rows", "value")"

Jon Manese Over a year ago

The company-data-table objects have a different structure so find_by_name is not applicable. I added extra code to scrape them.

|

Collectives™ on Stack Overflow

web scraping in r with SelectorGadget

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related