1

I have the following method:

public Article buildArticle(SNSpecific specific, String urlToScrape) throws IOException {


        Document page = Jsoup.connect(urlToScrape).timeout(10*1000).get();

        Article a = new Article();
        a.setWebsite("http://www.svensktnaringsliv.se/");
        a.setUrl(urlToScrape);
        a.setTitle(page.select(specific.getTitleSelector()).text());
        a.setDiscoveryTime(page.select(specific.getDateAndTimeSelector()).text());

        if(isPdfPage(urlToScrape))
        {
            Elements e = page.select("div.indepth-content > div.content > ul.indepth-list a");

            a.setText(page.select("div.readmoreSummary").text() + "For full article: " +
                    e.first().attr("href"));
        }else {
            a.setText(page.select(specific.getContentSelector()).text());
        }
        return a;
    }

The problem is that sometimes it cannot connect to the urlToScrape even I changed the timeout, and I dont want to wait too long for a page and thats why I am looking for an alternative solution except the timeout() method, what could be another approach to handle this problem?(I have about 200 pages to scrape).

1 Answer 1

1

what could be another approach to handle this problem?(I have about 200 pages to scrape).

I can see two options:

  • Give server some rest between two requests.
    Between two fetches make a random pause between 2000 ms and 5000 ms

  • Use a proxy
    If you don't want to make pauses between two fetches

Sign up to request clarification or add additional context in comments.

3 Comments

I have used the 1st option and it works perfectly ! :)
@imoteb It's not reflected in my answer but you can make the pause random between 2000 ms and 5000 ms.
I actually put some delay as you mentioned and if it throws some exception again I am putting the link in a queue to try again later

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.