Error Handling
everyone does it wrong
March 29, 2020Some thoughts about error handling have been running around my head for a while, sometimes leaking out in code reviews or internet discussions, so maybe it's about time I got them down in semi-coherent form. I don't think error handling is something that most programmers learn early enough in their careers. After all, it really doesn't matter if the code you write for class assignment handles all possible error conditions properly. In fact, littering the code with error handling kind of gets in the way of whatever the purpose of that particular exercise was. Sure, some graders will demand a kind of rote, performative error handling, but very few will actually demand that code run in an environment where errors (not counting malformed input) are actively injected to prove that the error handling is sufficient. Far more expect assignments to be written in a style (e.g. functional) that leaves out error handling entirely. So people come out of school with bad habits.
Out in the real world, by contrast, proper error handling is absolutely essential and often has more code devoted to it than to the "happy path" without errors. Sure, sometimes you can get away with not checking some errors. If an uncaught error causes an interactive script to fail, but it's a rare case and the script can easily/safely be re-run, maybe it's not worth jumping through any hoops. At hyperscale, it might even be OK if a rare error takes out a task occasionally, so long as it really is rare and restarting the task is not an issue. In some ways it might be better to have the task fail, if that makes the error more visible than a log entry would. But in general, it's better to handle errors than let them stop your program.
That idea can be extended. My opinion about different error-handling strategies is that it's not acceptable to just drop an error on the floor silently. That includes the case where errors are made invisible, as we'll get to in the discussion of particular techniques. If it's not handled, an error should be explicitly cleared or passed on. (In fact, I think it would be valid to say that handling an error must include explicit clearance so it's not a separate case at all.) Think of an error as a little "poison pill" that can not pass a function/method boundary without being "detoxified" first. An example of a similar approach is a "signaling NaN" in IEEE floating-point math. It's an explicitly invalid value - uninitialized, overflow/underflow, division by zero. Any attempt to use it in a subsequent calculation will generate an exception, unless it's explicitly converted to a "quiet NaN" first. This is a good model, except that it takes effect at run time instead of compile time.
With all that as background, let's look at some of the common approaches to error handling.
- The most basic kind of error handling, supported by just about every language for just about forever, is plain error codes. Some or all functions return error codes, and it's up to the caller to do something with them. This is the idiom everyone loves to hate - even those who think alternatives are worse. It makes writing and changing code very tedious. Nobody enjoys having to modify several layers of code to propagate errors in code where none had previously been possible (e.g. because they were pure calculation). Even worse, this approach makes it too easy to drop errors silently. Just don't look at the value, or don't even capture it in a local variable. Some compilers will (finally!) warn about this, but neither support nor usage is anywhere near universal and it's all too easy to quiet the compiler without actually handling the error (e.g. an "if" with an empty body).
P.S. Multiple return values clean up some of the mess, but are still sufficiently similar to plain error codes that there's little to discuss about them separately. - The second most common kind of error handling - or maybe it's first by now - is exceptions. There's great appeal here for those who are tired of the problems with plain error codes. I felt that appeal myself, when exceptions first entered mainstream programming. (Yes, I'm that old.) I still feel it sometimes, when I'm in a hackier mood. It sure is nice to be able to throw an exception at the bottom of a deep call stack and catch it at the top, without having to change every layer in between to propagate it. It's like a secret back channel between the two, and it's super convenient. Like many powerful techniques, I still think exceptions are cool when used very sparingly.
The problems with exceptions become more apparent when they're not used sparingly, and particularly at project (vs. transient or individual) scale. "Uncaught exception" might or might not be better than silent failure, especially if it might end up being shown to an end user, but in no case is it actually the right thing to have happen. Unfortunately, this is an all too common outcome in exception-heavy languages like Java or Python, and it's common because it can be hard to avoid. When you have a large team modifying different parts of a large codebase, it's very easy for one person to throw an exception that gets caught in the path that person knew about but remains uncaught in some other - then boom! There goes your whole program. It's also easy for the exception to "fly right over the head" of code between the thrower and the catcher, effectively but invisibly changing its behavior in ways that its own tests never seem to catch. Super-paranoid use of finally/scope-exit/defer constructs can avoid the worst ills, but in practice the result always seems to be that once clean and well-tested code keeps becoming less clean and well-tested as exceptions are added.
Lastly, exceptions tend to be very RPC-unfriendly (where "RPC" is used loosely to include things like futures). Every attempt I've seen to reconcile exceptions with RPC has been a total nightmare of error-prone ad hoc wrapper types. If you're going to do that, you might as well go with a more thought-out version of wrappers (read on). - One thing that's not actually error handling but is often confused with it is assertions. I only mention these because I've worked on too many codebases (including my current one) where people made that mistake. An assertion is not the same as a proper error check. It escalates what might have been a harmless error into a fatal one. An assertion is absolutely no different than throwing an AssertionFailure exception that nobody ever catches, and shares all of those drawbacks. While assertions do have their value in checking the assumptions with which code was written, they're best used as an adjunct to real error checking. Using them instead of proper error handling is the mark of a lazy programmer.
- The "new kid on the block" when it comes to error codes is optional/either/result types, including (non-)nullable types, all a subset of union types. Proponents of this approach would claim that it avoids the tedium inherent in plain old error codes, but I don't find that's really true. Sure, when you're first writing code passing an error on can be expressed more succinctly, and some syntactic sugar can be added to do likewise for handling one (e.g. "match" in Rust). OTOH, you can still get stuck converting N layers of code to wrap their return values in option/either/result types. Compared to exceptions, optional types avoid both the uncaught-exception and "flying over their heads" problem, plus they provide a clean wrapper abstraction when RPC is involved, so they're more purely beneficial.
It would seem that optional types are the clear winner here, so why am I still writing? Because they're still ... well, optional. Even if the language supports them, programmers don't have to use them. I'll illustrate this point with a picture.
The easiest way to get into that last/best corner is to start with optional types and make them the default. In other words, a function by default returns a {normal_value, error} union unless explicitly tagged as pure (in which case it's a compile error for it to construct and return an error value, or pass one on without unwrapping it). That alone is not sufficient, though. Consider the following function.
void do_a_thing() {
int? x = do_first_part(); // "int?" means an int/error union
if (x.is_value()) {
do_second_part(x.unwrap());
}
}
So yes, we did check for an error before calling do_second_part, but if do_first_part did return an error we're dropping it on the floor. Remember what I said about poison pills? We can't let them go out of scope without "detoxifying" them. By contrast, this would be OK.
void do_a_thing() {
int? x = do_first_part();
if (x.is_error()) {
return x; // yes, returning an error from a "void" function is OK
}
// More concisely, with a bit of syntactic sugar:
// int x = do_first_part() catch |err| return err;
// Or, even more concisely:
// try int x = do_first_part();
do_second_part(x.unwrap());
// With the alternatives, we could do:
// do_second_part(x);
}
In this case, we are handling the error if there is one. With the extra syntactic sugar, we can also make the non-error case very concise. But what if our error response is something other than to pass the error on? For example, we might try a different method, or a different server if what we're writing is some kind of network client. So try this on for size.
void do_a_thing() {
int? x = do_first_part();
if (x.is_value()) {
try do_second_part(x);
} else {
clear x;
try do_the_other_thing_instead();
}
}
Here, clear is a new keyword to do that detoxification so that the compiler doesn't yell at us when x goes out of scope. If we really wanted to, we could make this more concise with some extra syntax.
void do_a_thing() {
int x = do_first_part() catch |err| {
// It would almost be OK to make clearing implicit in this case.
// That would save one line, but better IMO to keep it explicit.
clear err;
try do_the_other_thing_instead();
}
try do_second_part(x);
}
Alternatively, we could discriminate among error types and return some immediately instead of clearing and trying an alternative method. I'll leave that as an exercise for the reader. The important thing is that we've followed our rule of never silently dropping an error on the floor, while making it as painless as possible to Do The Right Thing instead. Combine this with a defer/finally kind of construct to do cleanup when errors are returned (either explicitly or semi-explicitly via try) and I think it's about as good as real-world programming can get.
The existing language that comes closest to this ideal is Zig, which is why I've borrowed some of their catch/try syntax. However, I think their use of "!" in function definitions is the inverse of what I would have chosen. They use it to mean the function does return an error union instead of a plain type, whereas I would assume functions return error unions and use "!" to mean they do not. (NB "?" still makes sense for types IMO because they're still assumed to be plain.) Also, they don't treat "silent dropping" as a compile error and thus there's no explicit clear to indicate that the dropping is deliberate.
If I ever do implement my own programming language (most likely to facilitate easier distributed programming) this is the error model I'd start with. Maybe the syntax would end up being very different but, after 30+ years of cleaning up other people's messes because they lack the self-discipline to do the right thing when languages make that harder than necessary, I'd like to try writing code in a language that treats error handling as a first-class concern and encourages correct programming.