1

I'm trying to create a web-scraper that queries a lot of urls in parallel and waits for their responses using Task.WhenAll(). However if one of the Tasks are unsuccessful, WhenAll fails. I am expecting many of the Tasks to return a 404 and wish to handle or ignore those. For example:

string urls = Enumerable.Range(1, 1000).Select(i => "https://somewebsite.com/" + i));
List<Task<string>> tasks = new List<Task<string>>();
foreach (string url in urls)
{
    tasks.Add(Task.Run(() => {
        try
        {
            return (new HttpClient()).GetStringAsync(url);
        }
        catch (HttpRequestException)
        {
            return Task.FromResult<string>("");
        }
    }));
}
var responseStrings = await Task.WhenAll(tasks);

This never hits the catch statement, and WhenAll fails at the first 404. How can I get WhenAll to ignore exceptions and just return the Tasks that completed successfully? Better yet, could it be done somewhere in the code below?

var tasks = Enumerable.Range(1, 1000).Select(i => (new HttpClient()).GetStringAsync("https://somewebsite.com/" + i))));
var responseStrings = await Task.WhenAll(tasks);

Thanks for your help.

1 Answer 1

4

You need to use await to observe the exception:

var tasks = Enumerable.Range(1, 1000).Select(i => TryGetStringAsync("https://somewebsite.com/" + i));
var responseStrings = await Task.WhenAll(tasks);
var validResponses = responseStrings.Where(x => x != null);

private async Task TryGetStringAsync(string url)
{
  try
  {
    return await httpClient.GetStringAsync(url);
  }
  catch (HttpRequestException)
  {
    return null;
  }
}
Sign up to request clarification or add additional context in comments.

5 Comments

Hi Stephen, thanks for your response. In previous code I was awaiting my GetAsync call and it seemed to take far too long to execute and return all my Tasks. It seemed like it was awaiting each Task before creating the next one. Would that not be an issue here?
No. This code will execute concurrently. For more information, see my async intro.
Thanks! I tested your code and it's taking about 1 min to complete 200 tasks, which is roughly the performance I was getting before. If these were truly async and parallel, would that mean the last task to finish took a whole minute to retrieve a webpage?
@Zaataro: There are other considerations: there's client-side throttling and also possibly server-side throttling.
Thank you very much Stephen, I adjusted ServicePointManager.DefaultConnectionLimit and it worked like a charm. Your help is greatly appreciated!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.