Java Jsoup Error Handling

Question

I have the following method:

public Article buildArticle(SNSpecific specific, String urlToScrape) throws IOException {


        Document page = Jsoup.connect(urlToScrape).timeout(10*1000).get();

        Article a = new Article();
        a.setWebsite("http://www.svensktnaringsliv.se/");
        a.setUrl(urlToScrape);
        a.setTitle(page.select(specific.getTitleSelector()).text());
        a.setDiscoveryTime(page.select(specific.getDateAndTimeSelector()).text());

        if(isPdfPage(urlToScrape))
        {
            Elements e = page.select("div.indepth-content > div.content > ul.indepth-list a");

            a.setText(page.select("div.readmoreSummary").text() + "For full article: " +
                    e.first().attr("href"));
        }else {
            a.setText(page.select(specific.getContentSelector()).text());
        }
        return a;
    }

The problem is that sometimes it cannot connect to the urlToScrape even I changed the timeout, and I dont want to wait too long for a page and thats why I am looking for an alternative solution except the timeout() method, what could be another approach to handle this problem?(I have about 200 pages to scrape).

Stephan · Accepted Answer · 2016-04-25 09:24:14Z

1

what could be another approach to handle this problem?(I have about 200 pages to scrape).

I can see two options:

Give server some rest between two requests.
Between two fetches make a random pause between 2000 ms and 5000 ms
Use a proxy
If you don't want to make pauses between two fetches

edited Apr 25, 2016 at 9:24

answered Apr 24, 2016 at 19:22

Stephan

43.2k69 gold badges245 silver badges342 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Tano Over a year ago

I have used the 1st option and it works perfectly ! :)

Stephan Over a year ago

@imoteb It's not reflected in my answer but you can make the pause random between 2000 ms and 5000 ms.

Tano Over a year ago

I actually put some delay as you mentioned and if it throws some exception again I am putting the link in a queue to try again later

Collectives™ on Stack Overflow

Java Jsoup Error Handling

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related