I was running this simple code below to scrape the employee number from this Fortune 500 page. I used the Chrome's extention: SelectorGadget to identify that the number I want matches with ".info__row--7f9lE:nth-child(13) .info__value--2AHH7"
library(rvest)
library(dplyr)
#download google chrome extention: SelectorGadget
link = "https://fortune.com/company/walmart/"
page = read_html(link)
Employees = page %>% html_nodes(".info__row--7f9lE:nth-child(13) .info__value--2AHH7") %>% html_text()
Employees
However, it returned "character(0)". Does anyone know what is the cause? I feel it must be a simple mistake somewhere. Thanks in advance!
Update
Here is the code I modified based on Jon's comments.
a <- c("https://fortune.com/company/walmart/", "https://fortune.com/company/amazon-com/"
,"https://fortune.com/company/apple/"
,"https://fortune.com/company/cvs-health/"
,"https://fortune.com/company/unitedhealth-group/"
, "https://fortune.com/company/berkshire-hathaway/"
, "https://fortune.com/company/mckesson/"
,"https://fortune.com/company/amerisourcebergen/"
, "https://fortune.com/company/alphabet/"
, "https://fortune.com/company/exxon-mobil/"
,"https://fortune.com/company/att/"
,"https://fortune.com/company/costco/"
,"https://fortune.com/company/cigna/"
, "https://fortune.com/company/cardinal-health/"
,"https://fortune.com/company/microsoft/"
,"https://fortune.com/company/walgreens-boots-alliance/"
,"https://fortune.com/company/kroger/"
, "https://fortune.com/company/home-depot/"
,"https://fortune.com/company/jpmorgan-chase/"
,"https://fortune.com/company/verizon/"
,"https://fortune.com/company/ford-motor/"
, "https://fortune.com/company/general-motors/"
,"https://fortune.com/company/anthem/"
, "https://fortune.com/company/centene/"
,"https://fortune.com/company/fannie-mae/"
, "https://fortune.com/company/comcast/"
, "https://fortune.com/company/chevron/"
,"https://fortune.com/company/dell-technologies/"
,"https://fortune.com/company/bank-of-america-corp/"
,"https://fortune.com/company/target/")
find_by_name <- function(list_data, name, elem = NULL) {
idx <- which(sapply(list_data, \(x) x$name) == name, arr.ind = TRUE)
stopifnot(length(idx) > 0)
if (length(idx) > 1) { idx <- idx[1] }
dat <- list_data[[idx]]
if (is.null(elem)) dat else dat[[elem]]
}
numEmp <- numeric()
for (i in 1:length(a)){
json_data <- read_html(a[i]) |>
html_element("script#preload") |>
html_text() |>
sub("\\s*window\\.__PRELOADED_STATE__ = ", "", x = _, perl = TRUE) |>
sub(";\\s*$", "", x = _, perl = TRUE) |>
fromJSON(simplifyVector = FALSE)
temp<-gsub(".*https://fortune.com", "", a[i])
page_data <- json_data$components$page[[temp]]
info_data <- page_data |>
find_by_name("body", "children") |>
find_by_name("company-about-wrapper", "children") |>
find_by_name("company-information", "config")
numEmp[i] <- info_data$employees # Results will be fed into this numEmp variable.
}
numEmp
An error says
Error in find_by_name(page_data, "body", "children") : length(idx) > 0 is not TRUE
Should I somehow change the code stopifnot(length(idx) > 0)?
JavaScript(and therefore not present in the html source). I imagine you'd need something likeRSeleniumfor this.