Multi-threaded web scraping in Python/PySide/PyQt

Question

I'm building a web scraper of a kind. Basically, what the soft would do is:

User (me) inputs some data (IDs) - IDs are complex, so not just numbers
Based on those IDs, the script visits http://localhost/ID

What is the best way to accomplish this? So I'm looking upwards of 20-30 concurrent connections to do it.

I was thinking, would a simple loop be the solution? This loop would start QThreads (it's a Qt app), so they would run concurrently.

The problem I am seeing with the loop however is how to instruct it to use only those IDs not used before i.e. in the iteration/thread that had been executed just before it was? Would I need some sort of a "delegator" function which will keep track of what IDs had been used and delegate the unused ones to the QThreads?

Now I've written some code but I am not sure if it is correct:

class GUI(QObject):

   def __init__(self):
        print "GUI CLASS INITIALIZED!!!"
        self.worker = Worker()

        for i in xrange(300):
            QThreadPool().globalInstance().start(self.worker)

class Worker(QRunnable):

    def run(self):
        print "Hello world from thread", QThread.currentThread()

Now I'm not sure if these achieve really what I want. Is this actually running in separate threads? I'm asking because currentThread() is the same every time this is executed, so it doesn't look that way.

Basically, my question comes down to how do I execute several same QThreads concurrently?

Thanks in advance for the answer!

You should probably separate the logic from the GUI and only use QT for GUI. The crawler logic should be written in pure python or reuse an existing crawler like scrapy — Kien Truong
– Kien Truong, Commented Mar 12, 2012 at 16:03

Andrew Wilkinson · Accepted Answer · 2012-03-12 16:59:25Z

5

As Dikei says, Qt is red herring here. Focus on just using Python threads as it will keep your code much simpler.

In the code below we have a set, job_queue, containing the jobs to be executed. We also have a function, worker_thread which takes a job from the passed in queue and executes. Here it just sleeps for a random period of time. The key thing here is that set.pop is thread safe.

We create an array of thread objects, workers, and call start on each as we create it. From the Python documentation threading.Thread.start runs the given callable in a separate thread of control. Lastly we go through each worker thread and block until it has exited.

import threading
import random
import time

pool_size = 5

job_queue = set(range(100))

def worker_thread(queue):
   while True:
        try:
            job = queue.pop()
        except KeyError:
            break

        print "Processing %i..." % (job, )
        time.sleep(random.random())

    print "Thread exiting."

workers = []
for thread in range(pool_size):
    workers.append(threading.Thread(target=worker_thread, args=(job_queue, )))
    workers[-1].start()

for worker in workers:
    worker.join()

print "All threads exited"

answered Mar 12, 2012 at 16:59

Andrew Wilkinson

10.9k3 gold badges38 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Avaris Over a year ago

It depends actually. If the threads need to communicate with the GUI, QThreads will be better and simpler.

Bo Milanovich Over a year ago

Like @Avaris said, the thread needs to communicate with the GUI thread, and the existing code is largely written as QThread - so I need to use it instead of the Python's built-in threading module. I +1'd you anyway, for the detailed response.

Ravi Kumar Over a year ago

what is benefit of using time.sleep(random.random())

Andrew Wilkinson Over a year ago

It helps to make the example more realistic, each thread won't necessarily take the same amount of time to do their work.

Collectives™ on Stack Overflow

Multi-threaded web scraping in Python/PySide/PyQt

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related