5

I am trying to scrape data on pages from an API using the getURL function of the RCurl package in R. My problem is that I can't replicate the response that I get when I open the URL in Chrome when I make the request using R. Essentially, when I open the API page (url below) in Chrome it works fine but if I request it in using getURL in R (or using incognito mode in Chrome) I get a '500 Internal Server Error' response and not the pretty JSON that I'm looking for.

URL/API in question: http://www.bluenile.com/api/public/loose-diamond/diamond-details/panel?country=USA&currency=USD&language=en-us&productSet=BN&sku=LD04077082

Here is my (failed) request in [R].

test2 <- fromJSON(getURL("http://www.bluenile.com/api/public/loose-diamond/diamond-details/panel?country=USA&currency=USD&language=en-us&productSet=BN&sku=LD04077082", ssl.verifypeer = FALSE, useragent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36"))

My Research so Far First I looked at this prior question on stack and added in my useragent to the request (did not solve problem but may still be necessary): ViralHeat API issues with getURL() command in RCurl package

Next I looked at this helpful post which guides my rationale: R Disparity between browser and GET / getURL

My Ideas About the Solution This is not my area of expertise but my guess is that the request is lacking a cookie needed to complete the request (hence why it doesn't work in my browser in incognito mode). I compared the requests and responses from the successful request to the unsuccessful request:

Successful request: enter image description here

Unsuccessful request:

enter image description here

Anyone have any ideas? Should I try using the package RSelenium package that was suggested by MrFlick in the 2nd post I made.

0

1 Answer 1

6

This is a courteous site. It would like to know where you come from what currency you use etc. to give you a better user experience. It does this by setting a multitude of cookies on the landing page. So we follow suit and navigate to the landing page first getting the cookies then we goto the page we want:

library(RCurl)
myURL <- "http://www.bluenile.com/api/public/loose-diamond/diamond-details/panel?country=USA&currency=USD&language=en-us&productSet=BN&sku=LD04077082"
agent="Mozilla/5.0 (Windows NT 6.3; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0"

#Set RCurl pars
curl = getCurlHandle()
curlSetOpt(cookiejar="cookies.txt",  useragent = agent, followlocation = TRUE, curl=curl)
firstPage <- getURL("http://www.bluenile.com", curl=curl)
myPage <- getURL(myURL, curl = curl)

library(RJSONIO)
> names(fromJSON(myPage))
[1] "diamondDetailsHeader" "diamondDetailsBodies" "pageMetadata"         "expandedUrl"         
[5] "newVersion"           "multiDiamond"  

and the cookies:

> getCurlInfo(curl)$cookielist
 [1] ".bluenile.com\tTRUE\t/\tFALSE\t2412270275\tGUID\tDA5C11F5_E468_46B5_B4E8_D551D4D6EA4D"                                                                    
 [2] ".bluenile.com\tTRUE\t/\tFALSE\t1475342275\tsplit\tver~3&presetFilters~TEST"                                                                               
 [3] ".bluenile.com\tTRUE\t/\tFALSE\t1727630275\tsitetrack\tver~2&jse~0"                                                                                        
 [4] ".bluenile.com\tTRUE\t/\tFALSE\t1425230275\tpop\tver~2&china~false&french~false&ie~false&internationalSelect~false&iphoneApp~false&survey~false&uae~false" 
 [5] ".bluenile.com\tTRUE\t/\tFALSE\t1475342275\tdsearch\tver~6&newUser~true"                                                                                   
 [6] ".bluenile.com\tTRUE\t/\tFALSE\t1443806275\tlocale\tver~1&country~IRL&currency~EUR&language~en-gb&productSet~BNUK"                                         
 [7] ".bluenile.com\tTRUE\t/\tFALSE\t0\tbnses\tver~1&ace~false&isbml~false&fbcs~false&ss~0&mbpop~false&sswpu~false&deo~false"                                   
 [8] ".bluenile.com\tTRUE\t/\tFALSE\t1727630275\tbnper\tver~5&NIB~0&DM~-&GUID~DA5C11F5_E468_46B5_B4E8_D551D4D6EA4D&SESS-CT~1&STC~32RPVK&FB_MINI~false&SUB~false"
 [9] "#HttpOnly_www.bluenile.com\tFALSE\t/\tFALSE\t0\tJSESSIONID\tB8475C3AEC08205E5AC6252C94E4B858"                                                             
[10] ".bluenile.com\tTRUE\t/\tFALSE\t1727630278\tmigrationstatus\tver~1&redirected~false"     
Sign up to request clarification or add additional context in comments.

3 Comments

Awesome. I tried working with a cookiejar but that wasn't turning anything up. You had to visit their front page first. Clever. How did you know this was the case?
The fact that chrome incognito was failing led me to look at the landing page and what was being set there.
Bravo, works perfectly! Impressive problem solving and understanding.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.