0

So I have a nokogiri web scrape running perfectly on my local machine.

However when I try and run the web scrape on my production environment it get a 403 error code appear.

I believe this is down to the website blocking my ip of my server (probably because previous people using that ip have blocked it)

Is it possible to route the nokogiri request from my web server through a proxy server? If so how would I go about it?

This is the code I have at the moment.

doc = Nokogiri::HTML(open(URL HERE, 'User-Agent' => 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.854.0 Safari/535.2'))
6
  • Where are you getting the 403 from? From the websites your trying to scrape? Commented Jun 21, 2016 at 9:09
  • Indeed i am, I'm under the impression that they've blocked the server ip address, Thats why i thought of a proxy Commented Jun 21, 2016 at 9:33
  • Can you use Mechanise and proxy for it? Look here or here Commented Jun 21, 2016 at 9:42
  • I had a very very quick scan read, Isn't the charles proxy thing a desktop client? Thanks Commented Jun 21, 2016 at 9:47
  • It's true for Charles, but it's just a sample of proxy, i.e. ("localhost", 8888) in the example, which might be anything for your purpose. Actually, you can simply pass proxy to open method (see answer below), it's just I was using Mechanize all the time as a wrapper on Nokogiri. Commented Jun 21, 2016 at 9:53

1 Answer 1

0

Actually, you can simply use the :proxy parameter of the OpenURI open method.

open(*rest, &block)
#open provides `open' for URI::HTTP and URI::FTP.

...

The hash may include other options, where keys are symbols:
:proxy

Synopsis:    
:proxy => "http://proxy.foo.com:8000/"
:proxy => URI.parse("http://proxy.foo.com:8000/")

If :proxy option is specified, the value should be String, URI, boolean or nil.

Also, as a general consideration (being tedious now), you should search for alternatives around scrapping content, especially if it's done on a regular basis. Things like supported API or alternative sources. If your current server IP got blocked, the same can happen to the proxy.

Sign up to request clarification or add additional context in comments.

3 Comments

Probably you won't get good and free proxies. Free proxies work randomly, stop working occasionally, and so forth. You can work with them, but not for something that should be reliable. For reliable proxies you should search for paid services, there are many (horde of) and I can't judge on which ones are good or bad.
Also, as a general consideration (being tedious now), you should search for alternatives around scrapping content, especially if it's done on a regular basis. Things like supported API or alternative sources. If your current server IP got blocked, same can happen to the proxy.
Yeah i would prefer a api but the api the web provider use is either out of date or not updated alongside the website.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.