Which programming language to scrape data from web and do api calls at the same time?

Question

My project deals with scraping a lot of data from sites that don't have API or calling APIs if there is one. Using multiple threads to improve speed and work real time. Which would be the better programming language for this? I'm comfortable with Python. But, threading is an issue. Thus, thinking of using JS in node.js. Thus, which should I choose?

madjar · Accepted Answer · 2012-05-31 13:41:49Z

3

Threading is an issue in python only if you want to compute multiple things in parallel. If you just want to do a lot of requests, the limitation of the interpreter (only one thread interpreting python at one point) won't be a problem.

In fact, to make a lot of requests simultaneously, you don't even have to use a lot of threads. You can use an async requests library, like requests.async.

If you have some heavy computation to do with the result from the requests, you can always parallelize it in python using multiprocessing, which enable you to bypass the thread limitation I talked earlier.

answered May 31, 2012 at 13:41

madjar

13k2 gold badges48 silver badges53 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Hick Over a year ago

Thank you! I was shuddering thinking about moving away from Python.

madjar Over a year ago

No problem. In most cases, you can forget all you heard about the GIL and its limitations. It's only a problem when you need a lot of raw computation power, and in that case, you'd write some C extension or some Cython code to speed-up the thing. If you just want some free multi-core speedup, multiprocess is more than enough.

DisappointedByUnaccountableMod · Accepted Answer · 2021-06-10 14:14:11Z

In python you are able to multi-thread your scrapers . I've used Beautiful Soup in past, but there are alternatives.

Since I have experience using Beautiful Soup, a very simple example to multi-process a scraper using that is below.

from BeautifulSoup import BeautifulSoup
from multiprocessing import Process, JoinableQueue, cpu_count

jobs = []
queue = JoinableQueue()


class scraperClass(Process):
    def __init__(self,queue):
        Process.__init__(self)
        # Other init things
        
    def run(self):
        # your scraping code here
        # Perhaps save stuff to a DB?

            page = urllib2.urlopen(fullUrl)      # fullUrl can be passed in via the queue, or other possible methods
            soup = BeautifulSoup(page)
            # Read Beautiful Soup docs for how to parse further


def main():
    numProcesses = 2
    for i in xrange(numProcesses):
        p = scraperClass(queue)
        jobs.append(p)
        p.start()           # This will call the scapperClass.run() method

if __name__ == "__main__":
    main()

Sheena · Accepted Answer · 2012-05-31 13:38:11Z

0

I did a quick search and found a scraping framework for pytohon called Scrapy. It looks cool but I haven't tried it: http://scrapy.org/

Here's a quote from their tutorial:

"So you need to extract some information from a website, but the website doesn’t provide any API or mechanism to access that info programmatically. Scrapy can help you extract that information."

It says it can handle API calls too

answered May 31, 2012 at 13:38

Sheena

16.3k15 gold badges80 silver badges123 bronze badges

Collectives™ on Stack Overflow

Which programming language to scrape data from web and do api calls at the same time?

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related