2

I'm building a web scraper of a kind. Basically, what the soft would do is:

  1. User (me) inputs some data (IDs) - IDs are complex, so not just numbers
  2. Based on those IDs, the script visits http://localhost/ID

What is the best way to accomplish this? So I'm looking upwards of 20-30 concurrent connections to do it.

I was thinking, would a simple loop be the solution? This loop would start QThreads (it's a Qt app), so they would run concurrently.

The problem I am seeing with the loop however is how to instruct it to use only those IDs not used before i.e. in the iteration/thread that had been executed just before it was? Would I need some sort of a "delegator" function which will keep track of what IDs had been used and delegate the unused ones to the QThreads?

Now I've written some code but I am not sure if it is correct:

class GUI(QObject):

   def __init__(self):
        print "GUI CLASS INITIALIZED!!!"
        self.worker = Worker()

        for i in xrange(300):
            QThreadPool().globalInstance().start(self.worker)

class Worker(QRunnable):

    def run(self):
        print "Hello world from thread", QThread.currentThread()

Now I'm not sure if these achieve really what I want. Is this actually running in separate threads? I'm asking because currentThread() is the same every time this is executed, so it doesn't look that way.

Basically, my question comes down to how do I execute several same QThreads concurrently?

Thanks in advance for the answer!

1
  • You should probably separate the logic from the GUI and only use QT for GUI. The crawler logic should be written in pure python or reuse an existing crawler like scrapy Commented Mar 12, 2012 at 16:03

1 Answer 1

5

As Dikei says, Qt is red herring here. Focus on just using Python threads as it will keep your code much simpler.

In the code below we have a set, job_queue, containing the jobs to be executed. We also have a function, worker_thread which takes a job from the passed in queue and executes. Here it just sleeps for a random period of time. The key thing here is that set.pop is thread safe.

We create an array of thread objects, workers, and call start on each as we create it. From the Python documentation threading.Thread.start runs the given callable in a separate thread of control. Lastly we go through each worker thread and block until it has exited.

import threading
import random
import time

pool_size = 5

job_queue = set(range(100))

def worker_thread(queue):
   while True:
        try:
            job = queue.pop()
        except KeyError:
            break

        print "Processing %i..." % (job, )
        time.sleep(random.random())

    print "Thread exiting."

workers = []
for thread in range(pool_size):
    workers.append(threading.Thread(target=worker_thread, args=(job_queue, )))
    workers[-1].start()

for worker in workers:
    worker.join()

print "All threads exited"
Sign up to request clarification or add additional context in comments.

4 Comments

It depends actually. If the threads need to communicate with the GUI, QThreads will be better and simpler.
Like @Avaris said, the thread needs to communicate with the GUI thread, and the existing code is largely written as QThread - so I need to use it instead of the Python's built-in threading module. I +1'd you anyway, for the detailed response.
what is benefit of using time.sleep(random.random())
It helps to make the example more realistic, each thread won't necessarily take the same amount of time to do their work.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.