Scraping javascript website

Question

I'm able to scrape data off of basic html pages, but I'm having trouble scraping off the site below. It looks like the data is presented via javascript, and I'm not sure how to approach that issue. I'd prefer to use R to scrape, if possible, but could also use Python.

Any ideas/suggestions?

Edit: I need to grab the Year/Manufacturer/Model, the S/N, the Price, the Location, and the short description (starts with "Auction:") for each listing.

http://www.machinerytrader.com/list/list.aspx?bcatid=4&DidSearch=1&EID=1&LP=MAT&ETID=5&catid=1015&mdlx=Contains&Cond=All&SO=26&btnSearch=Search&units=imperial

Look into Selenium. There are a few examples of its use via R here on SO, but not many. — Thomas
– Thomas, Commented Mar 5, 2014 at 17:11
Use CasperJS, it lets you connect to the page, and wait for elements to be loaded. You can also inject JavaScript directly into the page context. — Andrei Nemes
– Andrei Nemes, Commented Mar 5, 2014 at 17:17

Carlos Cinelli · Accepted Answer · 2014-03-06 02:28:12Z

3

library(XML) 
library(relenium)

##downloading website
website<- firefoxClass$new() 
website$get("http://www.machinerytrader.com/list/list.aspx?pg=1&bcatid=4&DidSearch=1&EID=1&LP=MAT&ETID=5&catid=1015&mdlx=Contains&Cond=All&SO=26&btnSearch=Search&units=imperial") 
doc <- htmlParse(website$getPageSource())

##reading tables and binding the information
tables <- readHTMLTable(doc, stringsAsFactors=FALSE)
data<-do.call("rbind", tables[seq(from=8, to=56, by=2)])
data<-cbind(data, sapply(lapply(tables[seq(from=9, to=57, by=2)],  '[[', i=2), '[', 1))
rownames(data)<-NULL
names(data) <- c("year.man.model", "s.n", "price", "location", "auction")

This will give you what you want for the first page (showing just the first two lines here):

head(data,2)
      year.man.model      s.n      price location                                               auction
1 1972 AMERICAN 5530 GS14745W US $50,100       MI                   Auction: 1/9/2013; 4,796 Hours;  ..
2 AUSTIN-WESTERN 307      307  US $3,400       MT Auction: 12/18/2013;  AUSTIN-WESTERN track excavator.

To get all pages, just loop over them, pasting the pg=i in the address.

edited Mar 6, 2014 at 2:28

answered Mar 5, 2014 at 18:00

Carlos Cinelli

11.7k10 gold badges45 silver badges66 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Don S Over a year ago

Thanks for the quick response. When I run this code, however, I get null results. The readHTMLTable command doesn't seem to actually read anything. It just produces a null list. Any idea?

Don S Over a year ago

Also - I'm using windows 7, if that makes a difference.

Carlos Cinelli Over a year ago

Thanks for pointing that out, you are right, I was indeed using a different setup that allowed direct download. I updated the answer, first downloading the source with relenium and then using readHTMLTable, it should work now!

lukeA · Accepted Answer · 2014-03-06 00:36:31Z

2

Using Relenium:

require(relenium) # More info: https://github.com/LluisRamon/relenium
require(XML)
firefox <- firefoxClass$new() # init browser
res <- NULL
pages <- 1:2
for (page in pages) {
  url <- sprintf("http://www.machinerytrader.com/list/list.aspx?pg=%d&bcatid=4&DidSearch=1&EID=1&LP=MAT&ETID=5&catid=1015&mdlx=Contains&Cond=All&SO=26&btnSearch=Search&units=imperial", page)
  firefox$get(url) 
  doc <- htmlParse(firefox$getPageSource())
  res <- rbind(res, 
               cbind(year_manu_model = xpathSApply(doc, '//table[substring(@id, string-length(@id)-15) = "tblListingHeader"]/tbody/tr/td[1]', xmlValue),
                     sn = xpathSApply(doc, '//table[substring(@id, string-length(@id)-15) = "tblListingHeader"]/tbody/tr/td[2]', xmlValue),
                     price = xpathSApply(doc, '//table[substring(@id, string-length(@id)-15) = "tblListingHeader"]/tbody/tr/td[3]', xmlValue),
                     loc = xpathSApply(doc, '//table[substring(@id, string-length(@id)-15) = "tblListingHeader"]/tbody/tr/td[4]', xmlValue),
                     auc = xpathSApply(doc, '//table[substring(@id, string-length(@id)-9) = "tblContent"]/tbody/tr/td[2]', xmlValue))
  )
}
sapply(as.data.frame(res), substr, 0, 30)                        
#      year_manu_model                  sn               price         loc   auc                               
# [1,] " 1972 AMERICAN 5530"            "GS14745W"       "US $50,100"  "MI " "\n\t\t\t\t\tAuction: 1/9/2013; 4,796" 
# [2,] " AUSTIN-WESTERN 307"            "307"            "US $3,400"   "MT " "\n\t\t\t\t\tDetails & Photo(s)Video(" 
# ...

edited Mar 6, 2014 at 0:36

answered Mar 5, 2014 at 23:58

lukeA

54.4k5 gold badges102 silver badges102 bronze badges

4 Comments

Don S Over a year ago

Installed relenium, but I get "Error: WebDriverException" when I run your exact code above. Any idea on what might be causing this?

Don S Over a year ago

@ lukeA - the error is gone, but the "auc" field has two issues: 1) it's not pulling the full text, and 2) it alternately pulls the "Details & Photos" text for some reason (example: the 1st record pulls the auction data, the 2nd record pulls Details & Photos, the 3rd record pulls auction data...). Any idea?

Don S Over a year ago

Figured out the first issue - just set the sapply argument from 30 to 300. Also seeing that the "auc" field is pulling in \n\t\t\t\t\ for some reason.

lukeA Over a year ago

@user3384596 sapply shortened the output a bit, which is stored in res. You should be able to strip trailing control characters easily using e.g. stringr::str_trim() or tm::stripWhitespace() or just gsub. To the other issue: adapt the xpath to fit your needs.

Collectives™ on Stack Overflow

Scraping javascript website

2 Answers 2

3 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related