0

I'm building a crawler using Jsoup Library in Java.

The code structure is as follows:

public static BoneCP connectionPool = null;
public static Document doc = null;
public static Elements questions = null;
static
{

        // Connection Pool Created here
}

In the MAIN method, I've called getSeed() method from 10 different threads.

The getSeed() method selects 1 random URL from the database and forwards it to processPage() method.

The processPage() method connects to the URL passed from getSeed() method using jSoup library and extracts all the URLs from it and further adds them all to database.

This process goes on for 24x7.

The problem is: In processPage() method, it first connects to the URL sent from getSeed() method using:

doc = Jsoup.connect(URL)

And then, for each URL that is found by visiting that particular URL, a new connection is made again by jSoup.

questions = doc.select("a[href]");
for(Element link: questions)
{
doc_child = Jsoup.connect(link.attr("abs:href"))
}

Now, if I declare doc and questions variable as global variable and null them after whole processing in processPage() method, it solves the problem of memory leak but the other threads stops because doc and questions get nulled in between. What should I do next ?

1 Answer 1

2

It's crying "wrong design" if you are using static fields, particularly for that kind of state, and based on your description it seems like it's behaving very thread-unsafe. I don't know why you think you have a memory-leak at hand but whatever it is it's easier to diagnose if stuff is in order.

What I would say is, try getting something working based on something like this:

class YieldLinks implements Callable<Set<URI>>{
    final URI seed;
    YieldLinks(URI seed){
        this.seed = seed;
    }
}


public static void main(String[] args){
    Set<URI> links = new HashSet<>();
    for(URI uri : seeds){
        YieldLinks yieldLinks = new YieldLinks(uri);
        links.addAll(yieldLinks.call());
    }
}

Once this single threaded thing works ok, you could look at adding threads.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.