2

I am having 10 web crawlers which share a LinkedBlockingQueue.

From my debug view in Eclipse I was able to find out that when I have fetched several URLs (about 1000), the list.take() call takes very long.

This is how it works:

private (synchronized) URL getNextPage() throws CrawlerException {
    URL url;
    try {
        System.out.println(queue.size());
        url = queue.take();
    } catch (InterruptedException e) {
        throw new CrawlerException();
    }
    return url;
}

I only added synchronized and queue.size() for debugging purposes to see if the list is really filled when take() gets called. Yes, it is (1350 elements in this run).

queue.put() on the other hand gets only called when a URL is really new:

private void appendLinksToQueue(List<URL> links) throws CrawlerException {
    for (URL url : links) {
        try {
            if (!visited.contains(url) && !queue.contains(url)) {
                queue.put(url);
            }
        } catch (InterruptedException e) {
            throw new CrawlerException();
        }
    }
}

However, all other Crawlers do not seem to procude too many new URLs either, so the queue should not really block. This is how many URLs we have in the queue (in 5 second interval):

Currently we have sites: 1354
Currently we have sites: 1354
Currently we have sites: 1354
Currently we have sites: 1354
Currently we have sites: 1355
Currently we have sites: 1355
Currently we have sites: 1355

According to the Java doc contains() is inherited from AbstractCollection so I guess that this does not have at least anything to do with multithreading and thus cannot be the reason for blocking either.

Point is, from my debugging I can also see that the other threads also seem to be blocked in list.take(). However, it’s not an eternal block. Sometimes on of the crawler can go on, but they are stuck for more than a minute. Currently, I cannot see any of them going on.

Do you know how this could happen?

14
  • 1
    Can you make a thread dump when threads are blocked in take() and post it here? Commented Jul 14, 2012 at 16:40
  • 1
    That contains call on the LinkedBlockingQueue is going to be slow - it has to search the whole queue - and will prevent any access in the meantime. Can you instead put urls into the visited set before putting them into the queue? (This also avoids a race condition where another thread adds the url to the queue after you call contains on it.) Commented Jul 14, 2012 at 16:44
  • Why not use a thread pool? It seems like you're doing a producer-consumer pattern. Commented Jul 14, 2012 at 16:46
  • @Thomasz How can I do this? I tried rightclick on the thread + Copy stack but I cannot paste it anywhere (all that gets copied is the title of the thread). Commented Jul 14, 2012 at 16:52
  • 1
    It's also a race condition. Two threads could both see that the URL isn't there, and then both put it in. A checks, B checks, A puts, B puts. Commented Jul 14, 2012 at 17:03

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.