1

I am developing a web scraping algorithm using Priority Queue. I have a seed URL, I parsed all its links according to an algorithm. Then i putting all parsed URLs inside Priority Queue according to scores they've got from the algorithm. The algorithm start to select new seed URL from Priority Queue according to links scores. When a link is selected to be seed URL, it is dequeued from the Priority Queue and so on. The program is running without any problem. But the problem is:

  • Since enqueue Links operation number is getting bigger than dequeue Links operation number, the size of Priority Queue is getting bigger and bigger by the time. How can I control of it ? and is the size of Priority Queue will affect the performance of my crawler ?

  • When I am try to get number of crawled URLs per minutes, i am getting low results by the time. Ex: after running the program for 1 hour, the average of crawled page is getting low than if i run the program for 15 minutes and get the average of crawled pages. Is this happened because of Priority Queue size ? How to fix this?

I have these two issues and need your help and your idea to solve this problem in my crawled algorithm.

4
  • 1
    How big was the priority queue after running for that hour? If it was big enough to cause disk swapping then that could be your problem. Commented May 29, 2015 at 23:23
  • 1
    @ooga It was more than 450000 records. But How can I overcome this problem? Any idea to get more crawling URLs/M ? Commented May 29, 2015 at 23:27
  • 1
    450000 entries in a priority queue is quite reasonable. If you think the problem is in the queue, then posting the code would be the next step. But I would suggest that you profile the code to see where the bottleneck really is. Commented May 29, 2015 at 23:52
  • 1
    @user3386109. I think the problem is in enqueue and dequeue. I am adding n links as enqueue and in the same time i am polling just 1 links in the same time. By the time number of crawled URLs is getting decrease. I need to increase crawling URLs per minutes. Commented May 30, 2015 at 0:14

2 Answers 2

2

As rafalio noted, the size of your queue will increase continually because an average web page contains more than 1 outbound link and you will not reach closure over the whole internet :-)

But rather than searching the top N links from each page as rafalio suggests, I would instead recommend that you set a global maximum cap on the priority queue (say 30,000, for example). This way, as it begins to grow, each new link you traverse (the next highest priority link in the queue) will have accumulated greater and greater relative priority to everything else encountered thus far. You could be almost certain that by the time the queue reaches the cap, the highest priority item in it will be more important than the top N links on the current page (or any arbitrary page you could visit next).

Keep in mind that if your priority queue is array-backed, the insertion and removal ops will be O(n). A sorted linked list will also have O(n) insertion time. A binary-heap-backed priority queue should give the best overall performance for large n, with no worse than O(log n) on both insert and remove, and O(1) on min/max item lookup.

Sign up to request clarification or add additional context in comments.

Comments

2

The size of the priority queue will obviously grow large as pages usually contain more than one other link. This is basically a search over the whole graph induced by the pages and other URLs they contain, so with your current algorithm you will end up traversing the whole graph. Possibly figure out only the top N links from each page and only follow those.

Regarding speed. Visiting websites is parralelizable so you start with that, by, say, always crawling k links at a time. Otherwise, are you using a good implementation of a priority queue ? Are you properly closing all the network connections that your program opens?

1 Comment

So you said that the size and traversing the queue will affect the speed on my algorithm ? Is there possible to get low performance because i am calculating the score of each parsed urls for current seed url after i've parsed all its urls then enqueue it inside priority queue? for the 2nd part of your answer, I am not doing a multi-thread programming for visiting. And yes i am closing the connections when i am start the program.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.