0

I need to making multiple(more than 100) HTTP requests to Google Scholar, from a Java code to collect data. However, the site prevents this after around 20 requests or so, and produces a captcha. I have heard of 'Amazon Spot Instances' letting the IP address of requesting system change periodically and thus avoid the occurence of captcha, by ensuring that the requests do not come from a single IP. Can anyone help me through this, with further details?(an alternate method other than Amazon EC2 spot instances is also fine)

1 Answer 1

1

Changing IP addresses periodically aren't a unique feature of Spot Instances within the Amazon Environment (it's also available on the On demand and Reserved Instances), the Amazon CLI will allow you to assign, attach, deattach and release IP addresses as well.

Amazon's SDK will allow you to call the creation of a Spot Instance and attach an IP address, for the latter http://docs.aws.amazon.com/AWSEC2/latest/APIReference/ApiReference-query-RunInstances.html is a good starting point. It's well supported across a wide range of languages.

For Java I would look at http://aws.amazon.com/sdkforjava/ and get your feet wet, it's a powerful API!

Depending on how much experience you have with the AWS environment there is a bit of extra stuff to keep in mind, especially with Spot Instances. Spots can terminate at any time (literally mid-query) so you should build your app to be stateless, a good solution is to send the results into an S3 bucket. It has the added benefit of being able to deploy multiple instances at one time and have a single endpoint of data collection.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.