0

I am downloading a list of url which becomes a list. To run the list I use a loop. during the loop I use cbind to append the results to a data.frame. The way I mad it work does not seem the best way. I am wondering what other ways to accomplish this. To make the rbind work I had to take a copy of the file sturcture and use it as a blank template. There has to be a more R way to do this, so before I run the loop I run it once to get structure: final <- final[1,]

library(stringr)
library(gdata)
library(XML)

# get the files for department of revenue  OFM       

url = "http://dor.wa.gov/Content/AboutUs/StatisticsAndReports/stats_taxretail.aspx"

# use xml to get the names of the files that are xls and xlsx that have data
links = htmlParse(url)
src = xpathApply(links, "//a[@href]", xmlGetAttr, "href")
xls.src = src[grep(".xls", src, fixed=T)]
# xls.src = xls.src[1:3] # testing size

base = "http://dor.wa.gov" 
for (i in seq(xls.src)){
  url = paste0(base, xls.src[[i]])
  download.file(url, destfile="file.xls")
  retail <- read.xls("file.xls", header=TRUE)
  names(retail) <- tolower(names(retail))
  retail <- retail[complete.cases(retail$location),c(1,2, 5, 6)]
  retail$year <- paste0(unlist(str_extract_all(url, "\\(?[0-9]")), collapse="")
  names(retail)[3:4] <- c("firms", "taxable sales")
  final = rbind(final, retail) # final starts here with 1 row of dummy data
}
# this removes the first
wa.retail <- final[-1, ]

1 Answer 1

1

Rather than doing a for loop, use lapply to generate a list of data.frames. Then you can rbind them all at the end with do.call. Here's a sketch

dfs <- lapply(xls.src, function(src) {
    download.file(src, destfile="file.xls")
    read.xls("file.xls", header=TRUE)
})
final <- do.call(rbind, dfs)

Here dfs will be a list of data.frames generated by each call to read.xls. You can add back in all the data cleaning, but this is generally a better strategy.

Sign up to request clarification or add additional context in comments.

2 Comments

Ok I get the logic of lapply for what you show. However, when I add in the data cleaning steps. What is the file called when the data cleaning occurs and how is dfs returned? Sorry for being slow. I keep getting errors that dfs not found and I must be missing somethigng
Just make sure the last line in the lapply function returns the data.frame you just cleaned. You can explicitly call return() if you like.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.