Beyond Gluster
July 22, 2018I've been thinking a lot lately about what to do after Gluster. Don't worry (or celebrate), Gluster folks; my departure is not imminent. It's just a confluence of several factors.
- I've been doing this for a while. The general computing landscape has changed during that time. Even if Gluster was the best project ever, I'd still like to try something different.
- I'm tired of working in a codebase that's not only C, but old-fashioned even by the standards of most C programmers. I'm tired of not having modern language facilities, and debugging the characteristic C problems associated with their absence. I'm tired of trying to improve the situation and meeting resistance every time.
- There's a particular set of problems from the CloudFS/HekaFS days that are still crying out for solutions. My experience at Facebook, both in the form of operational experience with Gluster and exposure to other projects, has only heightened my awareness of these.
- I'm running out of time. I can see retirement from here. While I'll probably continue hacking even after that, it will be on things where I'm not already expert and my chances of "making a dent" without the advantages of working within a larger organization will be much smaller. I sort of have only one more shot to do something that's more than a hobby.
So what problems, within data storage, do I want to solve? They all come down to operational efficiency. A system isn't really scalable unless operations scale as well as capacity, performance, etc. One of Gluster's strengths has always been the "day one" experience of setting up or modifying a system with very few commands, but day two and day three and all the days after often aren't as easy. This is not really a knock against Gluster, BTW. If you think setting up or debugging Gluster is bad try Ceph, and if you think that's bad try Lustre, and so on. This is more about the general (bad) state of the art. Here are some questions that designers of such systems don't seem to ask themselves often enough, based on my experience in the past year.
- How predictable is performance - in normal operation, under heavy load, in the presence of faults? Peak performance is meaningless. Almost by definition, it's something you'll never see in real life. The performance levels that matter are the one you can guarantee to users, and the one that indicates the system is unhealthy. There's always going to be some margin of error, but a system with tighter margins is preferable to one where the variability is so great that all predictions become meaningless.
- When problems occur - and they will - how do you even know? Most storage systems have no useful built-in health indicators. A large part of the effort Facebook has expended on Gluster has been to add basic metrics - latencies, error rates, frequency of various sub-error events - that can be exported to a monitoring system and used by operators.
- What is the "blast radius" when a component fails? The whole point of having a distributed system is often to survive single component failures. Unfortunately, a system is only as strong as its weakest link. Even in systems that are generally well designed to handle failure, there are often failure modes that lead to some data becoming unavailable or some operations becoming impossible. Too often, certain real-life error modes trigger some sort of "contagion" that takes out the entire system in only two or three steps.
- What is the blast duration after a failure? Recovery or rebuild speed is an often overlooked aspect of storage-system design. It's all very well to say that the system can keep running despite a failure, but if it's still in a degraded state for too long after a component has been repaired or replaced then that's a problem. In a large enough system, or a large enough set of systems, that will result in a second failure occurring when you're already degraded, and that's bad. Relying on slower, less robust, or less tested secondary code paths for long periods of time can also trigger or exacerbate other problems.
So, "table stakes" for a post-Gluster system include predictable performance, meaningful metrics, and carefully minimized blast radius/duration. These are all hard to "bolt on" after the fact, no matter how modular the base system is. Like security (another table-stakes feature in its own right BTW) it has to be built in from the start. But wait; it gets better. This is where we get to that point about problems left over from the CloudFS/HekaFS days.
The other key thing that's absent from most distributed filesystems, but often present in other types of distributed data stores, is multi-tenancy. Building separate storage systems for each user, especially if each is built with redundancy to handle failures, is awkward and inefficient. You're going to have multiple users per storage system, and they're going to get in each other's way. Therefore, you need multi-tenancy.
- Users must be isolated from each other in a security sense, so that they can't see or manipulate each others' data.
- Users must be isolated from each other in terms of space used (i.e. quotas must exist and be enforced).
- Users must be isolated from each other in terms of performance. In other words, that performance predictability I mentioned earlier has to be for each user regardless not only of faults but of what other users are doing. This is the hardest part, because it means that usage of many resources - disk or flash I/O, memory, network - has to be allocated among competing users without starving any or "stranding" resources (i.e. preventing users from borrowing resources from others and thus using the system to its full capability when those others are idle).
One beneficial side effect of designing for multi-tenancy is that keeping resources and data structures separate for multiple users also tends to reduce the blast radius for failures. Even if one user manages to trigger a horrible bug, there's at least a chance that its effect on other users will be minimal.
So my goal for the next project is not to build the fastest or fanciest or even most scalable storage system, or one that spans geographic regions (though that's a separate goal I'd still like to address some day). What I really want is a storage system that both users and operators can rely on to keep doing the same simple job day in and day out without introducing periodic problems and panics and performance glitches. None of the open-source systems I've seen have those properties. Some proprietary ones claim to, but I've worked at enough proprietary storage companies to take those claims with a huge grain of salt. Besides, even if those claims are true the system's proprietary nature introduces its own failure modes that run contrary to the overall goal of storage that's reliable in the long term. Wouldn't it be nice if everybody could have a storage system that wasn't a constant source of pain?