0

I am trying to scrape the data corresponding to Table 5 from the following link: https://www.fbi.gov/about-us/cjis/ucr/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/5tabledatadecpdf/table_5_crime_in_the_united_states_by_state_2013.xls

As suggested, I used SelectorGadget to find the relevant CSS match, and the one I found that contained all the data (as well as some extraneous information) was "#page_content"

I've tried the following code, which yield errors:

fbi <- read_html("https://www.fbi.gov/about-us/cjis/ucr/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/5tabledatadecpdf/table_5_crime_in_the_united_states_by_state_2013.xls")

fbi %>%
html_node("#page_content") %>%
html_table()
Error: html_name(x) == "table" is not TRUE

#Try extracting only the first column:
fbi %>%
html_nodes(".group0") %>%
html_table()
Error: html_name(x) == "table" is not TRUE

#Directly feed fbi into html_table
data = fbi %>% html_table(fill = T)
#This output creates a list of 3 elements, where within list 1 and 3, there are many missing values.

Any help would be greatly appreciated!

4
  • 1
    It's got a "Download Excel" button, which is simpler. Commented Mar 21, 2016 at 6:27
  • Otherwise you can get more or less the table with fbi %>% read_html() %>% html_node('table.data') %>% html_table(fill = TRUE), but it's not very pretty. Commented Mar 21, 2016 at 6:46
  • @alistaire I do agree that downloading as excel is simpler. However, I would like others to quickly replicate my work by simply sourcing my .R file, without needing to download the data. Commented Mar 21, 2016 at 7:39
  • So call download.file on the link to that file, then parse it with xlxs or XLConnect. Commented Mar 21, 2016 at 7:42

1 Answer 1

1

You can download the excel file directly. After that you should look into the excel file and take data that you want into a csv file. After that you can work on the data. Below is the code for doing the same.

library(rvest)
library(stringr)
page <- read_html("https://www.fbi.gov/about-us/cjis/ucr/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/5tabledatadecpdf/table_5_crime_in_the_united_states_by_state_2013.xls")


pageAdd <- page %>%
  html_nodes("a") %>%       # find all links
  html_attr("href") %>%     # get the url
  str_subset("\\.xls") %>% # find those that end in xls
  .[[1]]     
mydestfile <- "D:/Kumar/table5.xls" # change the path and file name as per your system
download.file(pageAdd, mydestfile, mode="wb")

The data is not in a very formatted way. Hence downloading it in R, will be more confusing. To me this appears to be the best way to solve your problem.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.