In my experience, most errors in an application are bugs:
- Syntax Error
- Range Error (number out of bounds)
- Reference Error (undefined variable used)
- Type Error
- Custom Assertions
These should be caught before an application is in production. If they aren't caught, then typically the application will crash.
Other types of errors can be "recovered from". Examples include:
- Image Load Error
- Solution: Try loading again a few times. And/or display a default image.
- Network Error (more general)
- Solution: Try again. Otherwise, no solution.
- Hardware Error
- Solution: have backup servers idling waiting to be used perhaps
Other types are sort of in the middle:
- Security Error: Trying to access a resource without proper credentials or protocol.
- Out of Memory Error
- Distributed System Errors?
The question is what you should do with the errors in general in a distributed system, perhaps just focusing on the "middle level" ones. Instead of just crashing the app.
Perhaps there is a standard approach to this. I always just crashed the app and logged the error, which would then be documented as a bug. Then we would add some code to prevent that crashing from occurring, either by resolving the error in some way (e.g. if the parent directory didn't exist when creating a directory with mkdir, maybe switch to using mkdir -p), or by sending a notification back to the end user to either try again, try this or that differently, or let them know we logged the error.
But in a purely programmatic system, there is no human to report the error back to for aiding the continuation of the process (even though we may log the error for later debugging). Instead, the system needs to either crash or deal with the error.
So that is the main question:
How to deal with errors in a distributed system.
If you just crash the app, maybe you lose a lot of stored up in-memory data. So perhaps just before crashing, you do a dump (specified by the application developer), which can be used to "restore" the app somehow (like how the Chrome browser restores your tabs after crashing somehow). Perhaps there are sophisticated ways of resuming back to the state just before the error, but not doing what caused the error this time. Don't know how that would really work or be possible. Or maybe the system just acts "cautiously" on that code evaluation path the next time around (until the error is fixed), so it doesn't evaluate it, and instead returns some standard response which is accounted for in the code base.
I am just wondering what the general techniques are for dealing with errors in a distributed system. How to either recover from them, or move on (continue processing with try/catch sorta thing), so that the system never really crashes and is always in a known clean state. Not sure if it is possible but I thought I'd ask. Obviously you don't want to just try/catch every error and just ignore it, that would defeat the purpose of errors.