I'm building a crawler using Jsoup Library in Java.
The code structure is as follows:
public static BoneCP connectionPool = null;
public static Document doc = null;
public static Elements questions = null;
static
{
// Connection Pool Created here
}
In the MAIN method, I've called getSeed() method from 10 different threads.
The getSeed() method selects 1 random URL from the database and forwards it to processPage() method.
The processPage() method connects to the URL passed from getSeed() method using jSoup library and extracts all the URLs from it and further adds them all to database.
This process goes on for 24x7.
The problem is: In processPage() method, it first connects to the URL sent from getSeed() method using:
doc = Jsoup.connect(URL)
And then, for each URL that is found by visiting that particular URL, a new connection is made again by jSoup.
questions = doc.select("a[href]");
for(Element link: questions)
{
doc_child = Jsoup.connect(link.attr("abs:href"))
}
Now, if I declare doc and questions variable as global variable and null them after whole processing in processPage() method, it solves the problem of memory leak but the other threads stops because doc and questions get nulled in between. What should I do next ?