Scrape website data using rvest

Question

I am trying to scrape the data corresponding to Table 5 from the following link: https://www.fbi.gov/about-us/cjis/ucr/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/5tabledatadecpdf/table_5_crime_in_the_united_states_by_state_2013.xls

As suggested, I used SelectorGadget to find the relevant CSS match, and the one I found that contained all the data (as well as some extraneous information) was "#page_content"

I've tried the following code, which yield errors:

fbi <- read_html("https://www.fbi.gov/about-us/cjis/ucr/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/5tabledatadecpdf/table_5_crime_in_the_united_states_by_state_2013.xls")

fbi %>%
html_node("#page_content") %>%
html_table()
Error: html_name(x) == "table" is not TRUE

#Try extracting only the first column:
fbi %>%
html_nodes(".group0") %>%
html_table()
Error: html_name(x) == "table" is not TRUE

#Directly feed fbi into html_table
data = fbi %>% html_table(fill = T)
#This output creates a list of 3 elements, where within list 1 and 3, there are many missing values.

Any help would be greatly appreciated!

Otherwise you can get more or less the table with fbi %>% read_html() %>% html_node('table.data') %>% html_table(fill = TRUE), but it's not very pretty. — alistaire
– alistaire, Commented Mar 21, 2016 at 6:46
@alistaire I do agree that downloading as excel is simpler. However, I would like others to quickly replicate my work by simply sourcing my .R file, without needing to download the data. — Zslice
– Zslice, Commented Mar 21, 2016 at 7:39
So call download.file on the link to that file, then parse it with xlxs or XLConnect. — alistaire
– alistaire, Commented Mar 21, 2016 at 7:42

Kumar Manglam · Accepted Answer · 2016-03-21 07:24:23Z

You can download the excel file directly. After that you should look into the excel file and take data that you want into a csv file. After that you can work on the data. Below is the code for doing the same.

library(rvest)
library(stringr)
page <- read_html("https://www.fbi.gov/about-us/cjis/ucr/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/5tabledatadecpdf/table_5_crime_in_the_united_states_by_state_2013.xls")


pageAdd <- page %>%
  html_nodes("a") %>%       # find all links
  html_attr("href") %>%     # get the url
  str_subset("\\.xls") %>% # find those that end in xls
  .[[1]]     
mydestfile <- "D:/Kumar/table5.xls" # change the path and file name as per your system
download.file(pageAdd, mydestfile, mode="wb")

The data is not in a very formatted way. Hence downloading it in R, will be more confusing. To me this appears to be the best way to solve your problem.

Collectives™ on Stack Overflow

Scrape website data using rvest

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related