1

I was running this simple code below to scrape the employee number from this Fortune 500 page. I used the Chrome's extention: SelectorGadget to identify that the number I want matches with ".info__row--7f9lE:nth-child(13) .info__value--2AHH7"

library(rvest)
library(dplyr)
#download google chrome extention: SelectorGadget
link = "https://fortune.com/company/walmart/"
page = read_html(link)
Employees = page %>% html_nodes(".info__row--7f9lE:nth-child(13) .info__value--2AHH7") %>% html_text()
Employees

However, it returned "character(0)". Does anyone know what is the cause? I feel it must be a simple mistake somewhere. Thanks in advance!

Update

Here is the code I modified based on Jon's comments.

a <- c("https://fortune.com/company/walmart/", "https://fortune.com/company/amazon-com/"              
,"https://fortune.com/company/apple/"                   
,"https://fortune.com/company/cvs-health/"              
,"https://fortune.com/company/unitedhealth-group/"      
, "https://fortune.com/company/berkshire-hathaway/"      
, "https://fortune.com/company/mckesson/"                
,"https://fortune.com/company/amerisourcebergen/"       
, "https://fortune.com/company/alphabet/"                
, "https://fortune.com/company/exxon-mobil/"             
,"https://fortune.com/company/att/"                     
,"https://fortune.com/company/costco/"                  
,"https://fortune.com/company/cigna/"                   
, "https://fortune.com/company/cardinal-health/"         
,"https://fortune.com/company/microsoft/"               
,"https://fortune.com/company/walgreens-boots-alliance/"
,"https://fortune.com/company/kroger/"                  
, "https://fortune.com/company/home-depot/"              
,"https://fortune.com/company/jpmorgan-chase/"          
,"https://fortune.com/company/verizon/"                 
,"https://fortune.com/company/ford-motor/"              
, "https://fortune.com/company/general-motors/"          
,"https://fortune.com/company/anthem/"                  
, "https://fortune.com/company/centene/"                 
,"https://fortune.com/company/fannie-mae/"              
, "https://fortune.com/company/comcast/"                 
, "https://fortune.com/company/chevron/"                 
,"https://fortune.com/company/dell-technologies/"       
,"https://fortune.com/company/bank-of-america-corp/"    
,"https://fortune.com/company/target/")


find_by_name <- function(list_data, name, elem = NULL) {
  idx <- which(sapply(list_data, \(x) x$name) == name, arr.ind = TRUE)
  stopifnot(length(idx) > 0)
  if (length(idx) > 1) { idx <- idx[1] }
  dat <- list_data[[idx]]
  if (is.null(elem)) dat else dat[[elem]]
}

numEmp <- numeric()

for (i in 1:length(a)){
  json_data <- read_html(a[i]) |>
    html_element("script#preload") |> 
    html_text() |>
    sub("\\s*window\\.__PRELOADED_STATE__ = ", "", x = _, perl = TRUE) |>
    sub(";\\s*$", "", x = _, perl = TRUE) |>
    fromJSON(simplifyVector = FALSE)
  
  
  
  temp<-gsub(".*https://fortune.com", "", a[i])
  page_data <- json_data$components$page[[temp]]
  
  info_data <- page_data |> 
    find_by_name("body", "children") |>
    find_by_name("company-about-wrapper", "children") |>
    find_by_name("company-information", "config")
  
  
  numEmp[i] <- info_data$employees # Results will be fed into this numEmp variable.
}
numEmp

An error says

Error in find_by_name(page_data, "body", "children") : length(idx) > 0 is not TRUE

Should I somehow change the code stopifnot(length(idx) > 0)?

1
  • 3
    Are you sure this is supposed to work? It seems that data on fortune.com is dynamically produced/rendered using JavaScript (and therefore not present in the html source). I imagine you'd need something like RSelenium for this. Commented Jul 11, 2022 at 1:57

1 Answer 1

3

When I do document.querySelectorAll(".info__row--7f9lE:nth-child(13) .info__value--2AHH7") I see you want to scrape the # of employees. Maurits is right, looks like the data is downloaded as (inline) JSON and then rendered later. You can use Selenium to save the rendered page then apply your CSS selector there. Or you can extract the inline JSON and scrape it from there.

After some manual work, you can do the 2nd option like below in R 4.2.x

library(rvest)
library(jsonlite)

# R 4.1.x
sub2 <- function(x, pattern, replacement) sub(pattern, replacement, x = x, perl = TRUE)

url <- "https://fortune.com/company/walmart/"
json_data <- read_html(url) |>
  html_element("script#preload") |> 
  html_text() |>
  ## sub("\\s*window\\.__PRELOADED_STATE__ = ", "", x = _, perl = TRUE) |> # R 4.2.x
  sub2("\\s*window\\.__PRELOADED_STATE__ = ", "") |>                       # R 4.1.x
  ## sub(";\\s*$", "", x = _, perl = TRUE) |>  # R 4.2.x
  sub2(";\\s*$", "") |>                        # R 4.1.x
  fromJSON(simplifyVector = FALSE)

page_data <- json_data$components$page[["/company/walmart/"]]

find_by_name <- function(list_data, name, elem = NULL) {
  idx <- which(sapply(list_data, \(x) x$name) == name, arr.ind = TRUE)
  stopifnot(length(idx) > 0)
  if (length(idx) > 1) { idx <- idx[1] }
  dat <- list_data[[idx]]
  if (is.null(elem)) dat else dat[[elem]]
}

info_data <- page_data |> 
  find_by_name("body", "children") |>
  find_by_name("company-about-wrapper", "children") |>
  find_by_name("company-information", "config")

info_data$employees
#> [1] "2300000"

# Extra code to scrape company-data-table segments
library(purrr)
data_tables <- page_data |>
  find_by_name("body", "children") |>
  find_by_name("company-about-wrapper", "children") |>
  find_by_name("company-table-wrapper", "children")

rows <- data_tables |>
  lapply(\(x) c(x$config$data, x$config$change)) |>
  purrr::flatten() |>
  discard(~ is.null(.$key))

df <- data.frame(
  key = rows |> map_chr(~ .$key),
  title = rows |> map_chr(~ .$fieldMeta$title),
  type = rows |> map_chr(~ .$fieldMeta$type),
  value = rows |> map_chr(~ .$value)
)
Sign up to request clarification or add additional context in comments.

7 Comments

Thanks, Jon for your response and for pointing out two directions! I was trying to run your code above. However, the underscore symbol in "x = ," was marked as an error in my RStudio. And I got the following error message: "Error: unexpected input in: " html_text() |> sub("\\s*window\\.__PRELOADED_STATE_ = ", "", x = _"" Do you have any thought on what is causing the error?
You're probably using R 4.1.x since _ placeholder does not work (new in R 4.2.0) while |> & \(x) x$name work. I recommend you upgrade to 4.2. If you can't, see sub2 in the revised code. Basically rearrange the args to make it compatible with pipes. If you don't like pipes you can expand it one transformation-at-a-time using intermediate variables.
Jon, thank you so much for your reply! One more question, I'm really new to this: If I want to scrape other numbers on this same webpage (e.g., Measure Up Rank's "20"), how should I modify the code you drafted above? (I tried but couldn't figure this out 😭; I really appreciate your help!)
Jon, sorry I have to scrape almost all the table data. Something I have tried but didn't work: info_data <- page_data |> find_by_name("body", "children") |> find_by_name("company-about-wrapper", "children") |> find_by_name("company-data-table", "children")|> find_by_name("table", "children") |> find_by_name("rows", "value")"
The company-data-table objects have a different structure so find_by_name is not applicable. I added extra code to scrape them.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.