3

I've been working with .NET since its inception and doing parallel programming long before... still, I'm at a loss to explain this phenomenon. This code runs in a production system and has been performing its job for the most part, just seeking better understanding.

I'm passing in 10 URLs for concurrent processing into the following:

    public static void ProcessInParellel(IEnumerable<ArchivedStatus> statuses, 
                                         StatusRepository statusRepository, 
                                         WaitCallback callback, 
                                         TimeSpan timeout)
    {
        List<ManualResetEventSlim> manualEvents = new List<ManualResetEventSlim>(statuses.Count());

        try
        {
            foreach (ArchivedStatus status in statuses)
            {
                manualEvents.Add(new ManualResetEventSlim(false));
                ThreadPool.QueueUserWorkItem(callback,
                                             new State(status, manualEvents[manualEvents.Count - 1], statusRepository));
            }

            if (!(WaitHandle.WaitAll((from m in manualEvents select m.WaitHandle).ToArray(), timeout, false))) 
                throw ThreadPoolTimeoutException(timeout);
        }
        finally
        {
            Dispose(manualEvents);
        }
    }

The callback is something like:

    public static void ProcessEntry(object state)
    {
        State stateInfo = state as State;

        try
        {
            using (new LogTimer(new TimeSpan(0, 0, 6)))
            {
               GetFinalDestinationForUrl(<someUrl>);
            }
        }
        catch (System.IO.IOException) { }
        catch (Exception ex)
        {

        }
        finally
        {
            if (stateInfo.ManualEvent != null)
                stateInfo.ManualEvent.Set();
        }
    }

Each of the callbacks looks at a URL and follows a series of redirects (AllowAutoRedirect is intentionally set to false to handle cookies):

    public static string GetFinalDestinationForUrl(string url, string cookie)
    {
        if (!urlsToIgnore.IsMatch(url))
        {
            HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
            request.AllowAutoRedirect = false;
            request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
            request.Method = "GET";
            request.KeepAlive = false;
            request.Pipelined = false;
            request.Timeout = 5000;

            if (!string.IsNullOrEmpty(cookie))
                request.Headers.Add("cookie", cookie);

            try
            {
                string html = null, location = null, setCookie = null;

                using (WebResponse response = request.GetResponse())
                using (Stream stream = response.GetResponseStream())
                using (StreamReader reader = new StreamReader(stream))
                {
                    html = reader.ReadToEnd();
                    location = response.Headers["Location"];
                    setCookie = response.Headers[System.Net.HttpResponseHeader.SetCookie];
                }

                if (null != location)
                    return GetFinalDestinationForUrl(GetAbsoluteUrlFromLocationHeader(url, location),
                                                    (!string.IsNullOrEmpty(cookie) ? cookie + ";" : string.Empty) + setCookie);



                return CleanUrl(url);
            }
            catch (Exception ex)
            {
                if (AttemptRetry(ex, url))
                    throw;
            }
        }

        return ProcessedEntryFlag;
    }

I have a high-precision StopWatch around the recursive GetFinalDestinationForUrl call with a threshold of 6 seconds, and generally the callbacks that complete do so within this time.

However, the WaitAll, with a generous timeout of (0,0,60) for the 10 threads, still timesout regularly.

The exception prints out something like:

System.Exception: Not all threads returned in 60 seconds: Max Worker:32767, Max I/O:1000, Available Worker:32764, Available I/O:1000 at Work.Threading.ProcessInParellel(IEnumerable`1 statuses, StatusRepository statusRepository, WaitCallback callback, TimeSpan timeout) at Work.UrlExpanderWorker.SyncAllUsers()

This is running on .NET 4 with maxConnections set to 100 for all urls.

My only theory is that it's possible for a synchronous HttpWebRequest call to block longer than the specified timeout? That's the only reasonable explanation. The question is why and how best to force a real timeout on that operation?

Yes, I'm aware that the recursive call specifies a timeout of 5s on each call but that it could take several calls to process a given URL. But I hardly ever see the StopWatch warnings. For every 20-30 WaitAll timeout errors that I see, I might see one message indicating that a given thread took longer than 6 seconds. If the problem really is that the 10 threads cumulatively need more than 60 seconds, then I should see at least a 1:1 correlation (if not higher) between the messages.

UPDATE (March 30, 2012):

I'm able to confirm that the network calls alone don't respect the timeouts in certain circumstances:

            Uri uri = new Uri(url);
            HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(uri);
            request.AllowAutoRedirect = false;
            request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
            request.Method = "GET";
            request.KeepAlive = false;
            request.Pipelined = false;
            request.Timeout = 7000;
            request.CookieContainer = cookies;

            try
            {
                string html = null, location = null;

                using (new LogTimer("GetFinalDestinationForUrl", url, new TimeSpan(0, 0, 10)))
                    using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
                    using (Stream stream = response.GetResponseStream())
                    using (StreamReader reader = new StreamReader(stream))
                    {
                        html = reader.ReadToEnd();
                        location = response.Headers["Location"];
                        cookies = Combine(cookies, response.Cookies);

                        if (response.ContentLength > 150000 && !response.ContentType.ContainsIgnoreCase("text/html"))
                            log.Warn(string.Format("Large request ({0} bytes, {1}) detected at {2} on level {3}.", response.ContentLength, response.ContentType, url, level));
                    }

That code routinely logs entries that took 5-6 minutes to complete AND were not larger than 150000. And I'm not talking about an isolated server here or there, these are random (high-profile) media sites.

What exactly is going on here and how do we ensure that the code exits in a reasonable time?

4
  • In the first code sample (where you set up your jobs) you use m.WaitHandle. In ProcessEntry you use stateInfo.ManualEvent. Is that a typo? Commented Mar 21, 2012 at 16:16
  • 1
    If you are on .NET 4.0, you should look at using the TPL. I think that would help clean up this code a bit and make it easier to see what's going on. Example: stackoverflow.com/questions/5067412/… There's a timeout overload. Commented Mar 21, 2012 at 16:25
  • Are you using maxConnection for all URLs? msdn.microsoft.com/en-us/library/fb6y0fyc.aspx Commented Mar 21, 2012 at 16:32
  • chris: not a typo. aliostad: 100 for all urls, yes. ryan: tempted to move to TPL, but not sure there are any dramatic gains to be had and no reason why this code shouldn't work. Commented Mar 21, 2012 at 17:03

2 Answers 2

2

I concur with Aliostad. I do not see any flagrant issues with the code. Do you have any kind of locking in place that is causing these work items to serialize? I do not see any on the surface, but it is worth double checking in case your code is a more complex than what you posted. You will need to add logging code to capture the times when these HTTP requests are getting kicked off. Hopefully that will give you some more clues.

On an unrelated note I usually avoid using WaitHandle.WaitAll. It has some limitations like only allowing 64 handles and not working on an STA thread. For what it is worth I use this pattern instead.

using (var finished = new CountdownEvent(1);
{
  foreach (var item in workitems)
  {
    var capture = item;
    finished.AddCount();
    ThreadPool.QueueUserWorkItem(
      () =>
      {
        try
        {
          ProcessWorkItem(capture);
        }
        finally
        {
          finished.Signal();
        }
      }
  }
  finished.Signal();
  if (!finished.Wait(timeout))
  {
    throw new ThreadPoolTimeoutException(timeout);
  }
}
Sign up to request clarification or add additional context in comments.

Comments

1

I have fully reviewed your code. As far as it stands and I can see, I see no problem.

So there seems to be some other issue but in order to process I suggest:

Write a trace, debug or console output at start of GetFinalDestinationForUrl and end of it and include the URL too in the trace.

This should help you pin-point the problem. This will help you if HttpWebRequest is not respecting your 5 seconds timeout or .NET not respecting your 100 concurrent connections.

Update your question with result and I will review again.


UPDATE

I have reviewed your new enhancements. Well done for isolating the problem: now it is confirmed that WaitAll does not respect your timeout.

This now seems to be a Microsoft support issue, worth raising it to them - unless others can spot the problem with this detail. (it is worth asking Eric Lippert and Jon Skeet to read this question)

In my personal experience, even when I sent them a code to reproduce it and they reproduced it, I did not get a response. Now that was BizTalk, this is .NET framework so I imagine you probably get a better response.


My Crude Theory

I also have a crude theory, that I feel at the time of high load and maximum context switching, the thread responsible for wait is not given the context for much longer than expected so it does not get a chance to timeout and abort all those threads. The other theory is that the threads busy with their IO operation do take longer to abort and do not respond to aborting. Now this is as I said crude and proving or solving it well beyond my remit.

2 Comments

@Nariman OK, please have a look. As I mentioned, better including masters such as Skeet and Lippert here.
Thanks @aliostad. I meant to highlight that it is HttpWebRequest that's not respecting the timeout - that it takes 5-6 minutes to complete for requests that are no larger than 150000 bytes. This is Azure -> high-profile media sites.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.