I am developing a web scraping algorithm using Priority Queue. I have a seed URL, I parsed all its links according to an algorithm. Then i putting all parsed URLs inside Priority Queue according to scores they've got from the algorithm. The algorithm start to select new seed URL from Priority Queue according to links scores. When a link is selected to be seed URL, it is dequeued from the Priority Queue and so on. The program is running without any problem. But the problem is:
Since enqueue Links operation number is getting bigger than dequeue Links operation number, the size of Priority Queue is getting bigger and bigger by the time. How can I control of it ? and is the size of Priority Queue will affect the performance of my crawler ?
When I am try to get number of crawled URLs per minutes, i am getting low results by the time. Ex: after running the program for 1 hour, the average of crawled page is getting low than if i run the program for 15 minutes and get the average of crawled pages. Is this happened because of Priority Queue size ? How to fix this?
I have these two issues and need your help and your idea to solve this problem in my crawled algorithm.