3

My project deals with scraping a lot of data from sites that don't have API or calling APIs if there is one. Using multiple threads to improve speed and work real time. Which would be the better programming language for this? I'm comfortable with Python. But, threading is an issue. Thus, thinking of using JS in node.js. Thus, which should I choose?

3 Answers 3

3

Threading is an issue in python only if you want to compute multiple things in parallel. If you just want to do a lot of requests, the limitation of the interpreter (only one thread interpreting python at one point) won't be a problem.

In fact, to make a lot of requests simultaneously, you don't even have to use a lot of threads. You can use an async requests library, like requests.async.

If you have some heavy computation to do with the result from the requests, you can always parallelize it in python using multiprocessing, which enable you to bypass the thread limitation I talked earlier.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you! I was shuddering thinking about moving away from Python.
No problem. In most cases, you can forget all you heard about the GIL and its limitations. It's only a problem when you need a lot of raw computation power, and in that case, you'd write some C extension or some Cython code to speed-up the thing. If you just want some free multi-core speedup, multiprocess is more than enough.
3

In python you are able to multi-thread your scrapers . I've used Beautiful Soup in past, but there are alternatives.

Since I have experience using Beautiful Soup, a very simple example to multi-process a scraper using that is below.

from BeautifulSoup import BeautifulSoup
from multiprocessing import Process, JoinableQueue, cpu_count

jobs = []
queue = JoinableQueue()


class scraperClass(Process):
    def __init__(self,queue):
        Process.__init__(self)
        # Other init things
        
    def run(self):
        # your scraping code here
        # Perhaps save stuff to a DB?

            page = urllib2.urlopen(fullUrl)      # fullUrl can be passed in via the queue, or other possible methods
            soup = BeautifulSoup(page)
            # Read Beautiful Soup docs for how to parse further


def main():
    numProcesses = 2
    for i in xrange(numProcesses):
        p = scraperClass(queue)
        jobs.append(p)
        p.start()           # This will call the scapperClass.run() method

if __name__ == "__main__":
    main()

Comments

0

I did a quick search and found a scraping framework for pytohon called Scrapy. It looks cool but I haven't tried it: http://scrapy.org/

Here's a quote from their tutorial:

"So you need to extract some information from a website, but the website doesn’t provide any API or mechanism to access that info programmatically. Scrapy can help you extract that information."

It says it can handle API calls too

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.