6

I'm able to scrape data off of basic html pages, but I'm having trouble scraping off the site below. It looks like the data is presented via javascript, and I'm not sure how to approach that issue. I'd prefer to use R to scrape, if possible, but could also use Python.

Any ideas/suggestions?

Edit: I need to grab the Year/Manufacturer/Model, the S/N, the Price, the Location, and the short description (starts with "Auction:") for each listing.

http://www.machinerytrader.com/list/list.aspx?bcatid=4&DidSearch=1&EID=1&LP=MAT&ETID=5&catid=1015&mdlx=Contains&Cond=All&SO=26&btnSearch=Search&units=imperial

2
  • 1
    Look into Selenium. There are a few examples of its use via R here on SO, but not many. Commented Mar 5, 2014 at 17:11
  • 1
    Use CasperJS, it lets you connect to the page, and wait for elements to be loaded. You can also inject JavaScript directly into the page context. Commented Mar 5, 2014 at 17:17

2 Answers 2

3
library(XML) 
library(relenium)

##downloading website
website<- firefoxClass$new() 
website$get("http://www.machinerytrader.com/list/list.aspx?pg=1&bcatid=4&DidSearch=1&EID=1&LP=MAT&ETID=5&catid=1015&mdlx=Contains&Cond=All&SO=26&btnSearch=Search&units=imperial") 
doc <- htmlParse(website$getPageSource())

##reading tables and binding the information
tables <- readHTMLTable(doc, stringsAsFactors=FALSE)
data<-do.call("rbind", tables[seq(from=8, to=56, by=2)])
data<-cbind(data, sapply(lapply(tables[seq(from=9, to=57, by=2)],  '[[', i=2), '[', 1))
rownames(data)<-NULL
names(data) <- c("year.man.model", "s.n", "price", "location", "auction")

This will give you what you want for the first page (showing just the first two lines here):

head(data,2)
      year.man.model      s.n      price location                                               auction
1 1972 AMERICAN 5530 GS14745W US $50,100       MI                   Auction: 1/9/2013; 4,796 Hours;  ..
2 AUSTIN-WESTERN 307      307  US $3,400       MT Auction: 12/18/2013;  AUSTIN-WESTERN track excavator.

To get all pages, just loop over them, pasting the pg=i in the address.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for the quick response. When I run this code, however, I get null results. The readHTMLTable command doesn't seem to actually read anything. It just produces a null list. Any idea?
Also - I'm using windows 7, if that makes a difference.
Thanks for pointing that out, you are right, I was indeed using a different setup that allowed direct download. I updated the answer, first downloading the source with relenium and then using readHTMLTable, it should work now!
2

Using Relenium:

require(relenium) # More info: https://github.com/LluisRamon/relenium
require(XML)
firefox <- firefoxClass$new() # init browser
res <- NULL
pages <- 1:2
for (page in pages) {
  url <- sprintf("http://www.machinerytrader.com/list/list.aspx?pg=%d&bcatid=4&DidSearch=1&EID=1&LP=MAT&ETID=5&catid=1015&mdlx=Contains&Cond=All&SO=26&btnSearch=Search&units=imperial", page)
  firefox$get(url) 
  doc <- htmlParse(firefox$getPageSource())
  res <- rbind(res, 
               cbind(year_manu_model = xpathSApply(doc, '//table[substring(@id, string-length(@id)-15) = "tblListingHeader"]/tbody/tr/td[1]', xmlValue),
                     sn = xpathSApply(doc, '//table[substring(@id, string-length(@id)-15) = "tblListingHeader"]/tbody/tr/td[2]', xmlValue),
                     price = xpathSApply(doc, '//table[substring(@id, string-length(@id)-15) = "tblListingHeader"]/tbody/tr/td[3]', xmlValue),
                     loc = xpathSApply(doc, '//table[substring(@id, string-length(@id)-15) = "tblListingHeader"]/tbody/tr/td[4]', xmlValue),
                     auc = xpathSApply(doc, '//table[substring(@id, string-length(@id)-9) = "tblContent"]/tbody/tr/td[2]', xmlValue))
  )
}
sapply(as.data.frame(res), substr, 0, 30)                        
#      year_manu_model                  sn               price         loc   auc                               
# [1,] " 1972 AMERICAN 5530"            "GS14745W"       "US $50,100"  "MI " "\n\t\t\t\t\tAuction: 1/9/2013; 4,796" 
# [2,] " AUSTIN-WESTERN 307"            "307"            "US $3,400"   "MT " "\n\t\t\t\t\tDetails & Photo(s)Video(" 
# ...

4 Comments

Installed relenium, but I get "Error: WebDriverException" when I run your exact code above. Any idea on what might be causing this?
@ lukeA - the error is gone, but the "auc" field has two issues: 1) it's not pulling the full text, and 2) it alternately pulls the "Details & Photos" text for some reason (example: the 1st record pulls the auction data, the 2nd record pulls Details & Photos, the 3rd record pulls auction data...). Any idea?
Figured out the first issue - just set the sapply argument from 30 to 300. Also seeing that the "auc" field is pulling in \n\t\t\t\t\ for some reason.
@user3384596 sapply shortened the output a bit, which is stored in res. You should be able to strip trailing control characters easily using e.g. stringr::str_trim() or tm::stripWhitespace() or just gsub. To the other issue: adapt the xpath to fit your needs.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.