Memoirs: Wrap Up (technical)
October 29, 2020Now that the chronological sequence is done, it's time to distill some of the lessons I've learned from all of those jobs and projects. I've seen a lot of things that work, and a lot of things that don't work, so here's some free advice based on patterns I've seen.
Time Spent Up Front Is Worth It
I think of this as something like compound interest applied to human time and effort. Efforts made at the beginning of a project pay dividends throughout its whole lifetime. Even the most heroic efforts at the end of the project can't shorten a timeline which is already past. Serious designs and design reviews can avoid mistakes that can be costly to undo. Tests and formal methods can avoid bugs. Detailed plans can avoid time wasted from conflicts and throwaway code because isolated developers do things in an order other than dependency analysis would suggest. I know the "move fast" crowd hates to hear it. They make all sorts of excuses about how these early artifacts end up being out of date, but in all my years I have never seen that outweigh the benefit of having them in the first place. People who repeat that argument just need to grow up.
Have a Clear Memory Management Strategy
Better yet, use a language that takes care of memory management for you, if at all possible. Memory errors - leaks, use after free, double free, etc. - are well known to be the most common cause of both crashes and security bugs in software. My own experience is consistent with this. Make that all go away if you can. Only use "memory unsafe" languages when you really have to, when your performance isn't already bound by disk or network limits. Usually your performance problems have more to do with other problems (e.g. context switches) than with GC overhead. Even when you've ruled out other solutions and decided to take on the cost of manual memory management, you should have clear rules for memory lifecycles. Things can get really complicated when you have requests and associated resources that are enqueued and dequeued and have to wait for network replies (which you should never assume will happen at all) across many threads and continuations. Relying on reference counts doesn't lead to clarity, especially when you try to optimize away some of the reference counting with borrowed references and such. I've seen hundreds of serious bugs in code that was written by first-class C and C++ experts relying on these approaches. They don't work. Nobody is immune.
The clearest rules are those that depend not on invisible scope/refcounting magic but on visible lifecycle events such as a request being completed. Don't try to optimize allocation behavior for small bits of data associated with transient requests; preallocate as many of those things as you can. Don't invisibly attach random bits of information to a callback; attach them to the object the callback operates on, so that they're visible and obviously deallocated when the parent object is. The fewer parts of code have to be involved in managing an object's lifecycle, the more likely you'll be to get it right.
Have Clear Error Codes
One of the most frustrating things when debugging is having an error code in your hand and still no idea what it actually means. Sometimes this is because the same error code is used to represent many different underlying errors. When you make a filesystem call and get EIO back, it could mean just about anything. An even worse problem is when error codes are mapped between different sets, often automatically as they're passed across module or network boundaries. That way, when a client gets EBOLA you can't even look on the server to find where it came from because it was originally generated as GF_BAD_CHICKEN or WS_OOPSIE and shows up that way in all server logs or event streams. As you might guess from the names, I saw this cause untold wastes of time on both Gluster and Warm Storage ... and in TOPIX, and in REACT, and in Medley, and in MPFS. It's a very common problem, which BTW exists regardless of whether you're using plain error codes or error unions or exceptions. The naming is the problem, not the delivery mechanism.
I'm a fan of error objects which represent an actual instance of an error on a particular call at a particular time, rather than a type of error. One field in that object should be an identifier that identifies as uniquely as possible where the error was first recognized. It's often not worth making that totally unique (C/C++ macros using __FILE__ and __LINE__ can get you darned close) but you should try to minimize reuse of the same error ID at different points in the code. If you want to place an error into a category (e.g. EIO vs. ECONNREFUSED) you can add that in addition to the original immutable code. If you want to "ask questions" about an error code (e.g. retryable vs. fatal, failure vs. denial), make sure you use one clear set of helper functions implemented only one place, not slightly different variants everywhere.
If you're using an RPC mechanism that makes it hard to pass an actual error object instead of a simple numeric code, then (a) you screwed up and (b) at least do yourself a favor and maintain the original error code until the last possible moment instead of mapping it into something less informative as soon as it hits the network. For example, encode errors across all domains in your system as one byte for domain plus three for value within that domain. You should be able to squeeze that through even the most deficient RPC system without losing information, and then convert only when necessary.
Document Stuff in the Code
Memory-lifecycle events and error-code translations are two examples of things that should be clearly explained (not just "what" but "why") in the code that handles them. Locking assumptions or requirements are another. Other examples include warnings of pitfalls to avoid, assumptions about input or ordering or counts/sizes. Again, some will argue that these comments become out of date. Don't let them. Make it part of your culture to keep comments up to date with the code they describe. Make that part of your code review checklist. You do have a code review checklist, don't you? Again, it's possible for stale comments to cause a problem, but in my actual lived experience missing comments have caused more.
Design For Testability
Some code structures and protocols are simply more testable than others, and this should be one of your top criteria in designing them. Concurrency, parallelism, and reentrancy are all the enemies of testability, so try to use them in structured ways. If you need to queue up a bunch of things, make sure it's actually a queue, not some vaguely queue-like thing that forces you to test arbitrary reorderings. If you need to do X and Y in parallel before moving on to Z, by all means do so but consider processing their results in strict "X then Y" order. Use state machines and check for invalid states. And most of all, don't rely on wall clock time to ensure ordering of events. It never works, and one of the reasons it tends to fail in production is that you can never test it to show whether it works or not. Just don't.
Another thing that kills testability in a distributed system is having everyone check for "liveness" separately. One rule I've found is that there should be one part of a system that's responsible for determining whether things are alive or dead. Everything else should just assume they're alive until told otherwise. This avoids the need to test many different liveness tests in different parts of the system, and also eliminates a whole space of "Schrödinger states" where a single part of the system is deemed alive by some peers and dead by others.
Degrade Gracefully Under Load
Many of the projects I've worked on, especially but not exclusively the distributed ones, eventually had to deal with the problem of being overwhelmed. The clearest example was the Lustre MDS problem at SiCortex. Some kind of admission control or back-pressure really would have helped, and I hear that they implemented something like that after I stopped being involved with it. Having a thousand clients timeout and retry independently only makes the problem worse. Recognizing overload early and degrading gracefully ends up being better for everyone including clients whose requests are delayed.
Explicit congestion/overload signals work, but only if callers are prepared to recognize and respect them. Another approach that I've used to good effect (on Gluster at Facebook) was for a server to just pause reading new requests for increasing periods when there's too much work already in progress, and let the back-pressure occur in the form of network buffers filling up. That meant consuming their memory instead of the already-overloaded server's, and also prevented retry storms from aggravating the situation.
There are plenty of other things I could add. Maybe I'll do a Part Two some time, but not right away.