I have been trying to make my first attempt at a threaded script. Its going to eventually be a web scraper that hopefully works a little faster then the original linear scraping script I previously made.
After hours of reading and playing with some example code. Im still not sure what is considered correct as far as an implementation goes.
Currently I have the following code that I have been playing with:
from Queue import Queue
import threading
def scrape(queue):
global workers
print worker.getName()
print queue.get()
queue.task_done()
workers -= 1
queue = Queue(maxsize=0)
threads = 10
workers = 0
with open('test.txt') as in_file:
for line in in_file:
queue.put(line)
while not (queue.empty()):
if (threads != workers):
worker = threading.Thread(target=scrape, args=(queue,))
worker.setDaemon(True)
worker.start()
workers += 1
The idea is that I have a list of URLs in the test.txt file. I open the file and put all of the URLs in the queue. From there I get 10 threads running that pull from the queue and scrape a webpage, or in this example simply print out the line that was pulled.
Once the function is done I remove a 'worker thread' and then a new one replaces it until the queue is empty.
In my real world implementation at some point I will have to take the data from my function scrapes and write it to a .csv file. But, right now Im just trying to understand how to implement the threads correctly.
I have seen similar examples like the above that use 'Thread'...and I have also seen 'threading' examples that utilize an inherited class. I'd just like to know what I should be using and the proper way to manage it.
Go easy on me here, Im just an beginner trying to understand threads....and yes I know it can get very complicated. However, I think this should be easy enough for a first try...