2

I have been trying to make my first attempt at a threaded script. Its going to eventually be a web scraper that hopefully works a little faster then the original linear scraping script I previously made.

After hours of reading and playing with some example code. Im still not sure what is considered correct as far as an implementation goes.

Currently I have the following code that I have been playing with:

from Queue import Queue
import threading

def scrape(queue):
    global workers
    print worker.getName()
    print queue.get()
    queue.task_done()
    workers -= 1

queue = Queue(maxsize=0)
threads = 10
workers = 0


with open('test.txt') as in_file:       
    for line in in_file:
        queue.put(line)

while not (queue.empty()):
    if (threads != workers):
        worker = threading.Thread(target=scrape, args=(queue,))
        worker.setDaemon(True)
        worker.start()
        workers += 1

The idea is that I have a list of URLs in the test.txt file. I open the file and put all of the URLs in the queue. From there I get 10 threads running that pull from the queue and scrape a webpage, or in this example simply print out the line that was pulled.

Once the function is done I remove a 'worker thread' and then a new one replaces it until the queue is empty.

In my real world implementation at some point I will have to take the data from my function scrapes and write it to a .csv file. But, right now Im just trying to understand how to implement the threads correctly.

I have seen similar examples like the above that use 'Thread'...and I have also seen 'threading' examples that utilize an inherited class. I'd just like to know what I should be using and the proper way to manage it.

Go easy on me here, Im just an beginner trying to understand threads....and yes I know it can get very complicated. However, I think this should be easy enough for a first try...

1
  • Threading in general is fairly complex and most of the time it is easier to use an abstraction library like multiprocessing(.dummy) or concurrent.futures Commented Jul 29, 2016 at 14:57

1 Answer 1

3

On Python 2.x multiprocessing.dummy (which uses threads) is a good choice as it is easy to use (also possible in Python 3.x)

If you find out scraping is CPU-limited and you have multiple CPU cores, this way you can quite simply switch to real multiprocessing possibly gaining a big speedup.

(Python often does not profit from multiple CPUs with threading as much as with multiple processes because of a performance optimization - you have to find out yourself what is faster in your case)

With mutliprocessing.dummy you could do

from multiprocessing.dummy import Pool
# from multiprocessing import Pool # if you want to use more cpus

def scrape(url):
    data = {"sorted": sorted(url)} # normally you would do something more interesting
    return (url, data)

urls=[]
threads = 10

if __name__=="__main__":
    with open('test.txt') as in_file:       
        urls.extend(in_file) # lines

    p=Pool(threads)
    results=list(p.imap_unordered(scrape,urls))
    p.close()
    print results # normally you would process your results here

On Python 3.x, concurrent.futures might be a better choice.

Sign up to request clarification or add additional context in comments.

4 Comments

That seems to be much easier to me at first glance. In addition, after playing around with it a bit, it is faster. One problem I have been running into with the actual scraping part of the script is that I am sharing variables that I don't want to share like the scraped info I am trying to write to .csv. If I run 5 threads to start scraping I will get 5 duplicates being written to my csv despite the scraper correctly visiting the pages I want. When I Lock() and Release() my threads the script then just runs linearly again. How can I prevent my threads from over writing each others data?
if you dont want to share info between threads/processes that you want to put together afterwards, you put that in the return value of your scrape function - do not write to variables outside your function if you dont have to
I must say I learned a lot about threads and multiprocessing from your example code. Just wanted to give you a thanks. In addition, as it turned out the the multiprocessing was much faster and cleaner to implement IMO. Just curious if that's the module of choice for professional developers? Or is there a benefit to low level thread modules?
There are several benefits from lower-level (normal) threading: More control/flexibility (eg usually you would normally implement an asyncronous ui-thead with normal threading), faster execution if done right and finally the multiprocessing module is relatively new in Python and the normal threading exists very similar in other languages so if you are used to this or want example code you might want this. Also mixing normal threading and these higher level modules can make things complicated.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.