How to deal with errors in a distributed system

Question

In my experience, most errors in an application are bugs:

Syntax Error
Range Error (number out of bounds)
Reference Error (undefined variable used)
Type Error
Custom Assertions

These should be caught before an application is in production. If they aren't caught, then typically the application will crash.

Other types of errors can be "recovered from". Examples include:

Image Load Error
- Solution: Try loading again a few times. And/or display a default image.
Network Error (more general)
- Solution: Try again. Otherwise, no solution.
Hardware Error
- Solution: have backup servers idling waiting to be used perhaps

Other types are sort of in the middle:

Security Error: Trying to access a resource without proper credentials or protocol.
Out of Memory Error
Distributed System Errors?

The question is what you should do with the errors in general in a distributed system, perhaps just focusing on the "middle level" ones. Instead of just crashing the app.

Perhaps there is a standard approach to this. I always just crashed the app and logged the error, which would then be documented as a bug. Then we would add some code to prevent that crashing from occurring, either by resolving the error in some way (e.g. if the parent directory didn't exist when creating a directory with mkdir, maybe switch to using mkdir -p), or by sending a notification back to the end user to either try again, try this or that differently, or let them know we logged the error.

But in a purely programmatic system, there is no human to report the error back to for aiding the continuation of the process (even though we may log the error for later debugging). Instead, the system needs to either crash or deal with the error.

So that is the main question:

How to deal with errors in a distributed system.

If you just crash the app, maybe you lose a lot of stored up in-memory data. So perhaps just before crashing, you do a dump (specified by the application developer), which can be used to "restore" the app somehow (like how the Chrome browser restores your tabs after crashing somehow). Perhaps there are sophisticated ways of resuming back to the state just before the error, but not doing what caused the error this time. Don't know how that would really work or be possible. Or maybe the system just acts "cautiously" on that code evaluation path the next time around (until the error is fixed), so it doesn't evaluate it, and instead returns some standard response which is accounted for in the code base.

I am just wondering what the general techniques are for dealing with errors in a distributed system. How to either recover from them, or move on (continue processing with try/catch sorta thing), so that the system never really crashes and is always in a known clean state. Not sure if it is possible but I thought I'd ask. Obviously you don't want to just try/catch every error and just ignore it, that would defeat the purpose of errors.

How do you define the term "distributed system" in this context? — Dan Pichelman
– Dan Pichelman, Commented Jul 3, 2018 at 19:08
You should throw exceptions like you would in a local application. They should bubble up back to the calling client, your middleware will facilitate that. — Martin Maat
– Martin Maat, Commented Jul 4, 2018 at 5:33

user53019 · Accepted Answer · 2018-07-04 13:40:47Z

I have a bias towards writing software for businesses, or what could be called lower-case 'e' enterprise software. There are a number of incorrect assumptions behind the premise of your question. Let's start with:

most errors in an application are bugs

Unless we're quibbling over the definition of "most", there's a very poor assumption here along with your definition of bugs. All of those examples are failures to appropriately sanitize & check the inputs to your methods or services. We don't need to be Marvin the Paranoid Robot, but our methods need to make sure they weren't given garbage before we put that code into production.

The next major assumption is here, and while I cherry-picked this comment, there's a recurring them in your question.

If they aren't caught, then typically the application will crash.

For the love of all that is good and sacred, no, just plain no. Do not crash the application just because it received some bad input or failed a logical condition or ... anything. Log the error, throw an exception, keep servicing the bajillion other calls that the application ends to support.

To paraphrase wiser people, the difference between a bug and an a malicious attack is intent. Coincidentally, you can protect against both at the same time - recognize (guard check), report (log), run away (throw an exception).

I used to think more closely like you did until I learned that the purpose of (enterprise) software is to fulfill the business' rules. There is rarely sufficient time to find and fulfill 100% of those rules, so we build systems that we hope are robust enough to handle the vast majority of the paths we might follow.

We then use guard checks, logging, and exceptions to protect against the cases we or the business didn't anticipate. Centralized logging makes it a lot easier to aggregate issues. Automated monitoring makes this even more effective. Unhandled exception handlers are another technique to keep the application alive as well.

From there, we monitor, evaluate, and iterate as needed and based upon the business' priorities.

Avoiding a crash due to bad user input is perfectly sensible. Avoiding a crash in a subsystem due to bad input from another system (not user input), is bad advice. Software is built on preconditions, invariants, and proofs; once the invariants or pre-conditions are invalid, the proofs are invalid, and the entire state of the sub-system is suspect. In such a situation, a "crash" (restart) of that portion of the sub-system is often the only option for recovery. — Frank Hileman
– Frank Hileman, Commented Jul 4, 2018 at 16:38
@FrankHileman I think you are agreeing with Glen on that point. The overlap is clearest if we consider a stateless service that handles requests, for example a web server: if an exception is encountered the current request should fail/crash/whatever, but the service as a whole should log the problem and continue with the next request. It then doesn't matter if the cause of the failure was external to the organization, external to the service, or internal to the service. — amon
– amon, Commented Jul 4, 2018 at 17:23

Stack Exchange Network

How to deal with errors in a distributed system

1 Answer 1

Your Answer

Hot Network Questions

How to deal with errors in a distributed system

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions