0

I have written a script to do some research on HTTP Archive data. This script needs to make HTTP requests to sites scraped by HTTP Archive in order to classify sites into groups (e.g., Drupal, WordPress, etc). The script is working really well; however, the list of sites that I am handling is 300,000 sites long.

I would like to be able to complete the categorization of sites as fast as possible. I have experimented with running multiple instances of the script at the same time and it is working well with appropriate locks in place to prevent race conditions.

How can I max this out to get all of these operations completed as fast as possible? For instance, I am looking at spinning up a VPS with 8 CPUs and 16 GB RAM. How do I maximize these resources to make sure I'm using every bit of processing power possible? I may consider spinning up something more powerful, but I want to make sure I understand how to get the most out of it so I'm not wasting money.

Thanks!

2
  • are your ops cpu bound or io bound? Commented Sep 7, 2014 at 6:34
  • I'm guessing more io bound than CPU bound; although, networking is likely going to be the real bottleneck. Commented Sep 7, 2014 at 13:28

1 Answer 1

1

Multiprocessing module is the best option that lets you harness the maximum power of your 8 CPUs: https://docs.python.org/3.3/library/multiprocessing.html

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.