Micro Services, Macro Problems
April 08, 2020I've been thinking and talking about issues of centralized vs. decentralized systems for a long time. Some of my earliest writings on the subject are still around, as archived Usenet posts from 1989 and 1990. One of the recurring themes through all of that thirty-plus years is that - contrary to what seems to be popular belief - it's entirely possible for a system to be too decentralized. For every system, there is a scale at which coordination costs or internal contention start to outweigh the benefits of having made it distributed or decentralized (not getting into that debate today) in the first place. Usually I've focused on cost in terms of performance. Today I'll mostly focus on cost in terms of correctness and reliability instead.
Background: Gluster
For most of a decade, I worked on Gluster. For those who don't already know, that's a distributed filesystem. There are many decisions to be made regarding the basic design of any such system, and Gluster is to a large degree characterized by two of them.
- Complexity is pushed toward the clients as much as possible. For example, clients do most of the work in finding objects, instead of querying central metadata servers (as e.g. in Lustre or HDFS). Replication is directly from clients to storage servers, not funneled through leaders or proxies. Those storage servers are called "bricks" because they're supposed to be dumb, just responding to simple requests from smarter clients.
- There are very few kinds of servers. Originally there were just the brick daemons. Then there were management daemons as well. For a long time that was it. In later years more kinds of services were added, but mostly those were associated with extra functionality. You didn't have to configure and connect multiple kinds of servers to have even the most basic kind of setup, as e.g. in Ceph.
I'm not going to say whether these decisions were right or wrong, especially since those answers change according to purpose and scale. What I will say is that they helped make it easier to reason about the system, and to test or debug it. To see why, let's create some contrast.
Background: That Other Thing
I work on a very different system now. I can't describe it in too much detail without getting in trouble, but I will call out a couple of differences from Gluster. For one, it works at a much greater scale. The largest Gluster systems out there, indeed the largest ones possible, are a tiny fraction of a single cluster that I work on now. Not unrelatedly, there is also a significant difference in philosophy. One of this system's most notable features is that it is composed of many kinds of services. There are multiple services devoted to detecting and correcting errors, multiple services devoted to various forms of garbage collection, multiple services devoted to rebalancing and transcoding and other kinds of system optimization. And that's where I stop talking about the system I'm not supposed to be talking about, because as far as I can tell it's merely an example of a design approach that's becoming common for all kinds of applications. In a way, I'm only mentioning it to show that I'm familiar with how systems designed in this way work.
The Problem: Correctness
So what's wrong with designing a system like this? It does work, doesn't it? It does scale, doesn't it? Well yes, but with caveats. As often happens, in solving some problems it creates others. One of those is difficulty of validating that it does in fact work - in all cases, not just in the obvious "happy path" of requests flowing through the system without error or delay. When you have more than dozen services potentially making changes to the state of a single object, there's a very large "state space" based on the order and timing of those changes. How do you explore that entire space to make sure there are no bugs lurking in its less likely (but still possible) corners? You basically have two options.
- Formal protocol analysis and verification. This requires a fairly specialized set of skills and tools. Also, verification is usually based on a description of the system separate from its actual implementation, so what you've verified is not the system as it really exists.
- Testing. This at least tests The Real Thing (more or less) but the same "process sprawl" that exploded the state space also requires solving a very difficult orchestration problem to force every ordering and explore every corner of that space.
Either way, if you really want to prove that your system is correct you're going to need a large engineering team - negating any supposed "developer efficiency" advantage of having built the system this way in the first place.
Sidebar: "stateless services" don't solve this. Unless you're doing linear (non-iterative) pure calculation, you're going to be mutating state. Whether you do it by copying with modification or modification in place is an implementation detail. Whether you do it internally or by trying to fob it off on a separate subsystem and make it their problem is an implementation detail. If you mediate access to state, you're stateful in the way that matters. "Stateless" is about APIs and process life cycles, irrelevant to this discussion.
The Problem: Consistent Performance
Besides correctness, sprawly-microservice systems have another problem: delivering consistent performance to their users. One of the problems we often dealt with in Gluster was how to assign priority to background activities. Set it too high and users would complain about their new I/O being starved. Set it too low and users would complain about those activities never completing. This could particularly dire when the activity in question was repair ("self-heal") leaving data vulnerable to loss, but it also wasn't much fun when the activity was rebalancing and failure to complete it meant persistent bottlenecks or disk-full errors. Now, here's the kicker: Gluster had it easy. This problem becomes much worse when you have even more instances of even more background services competing with one another and with the ever-present stream of new user requests. Remember what I said right at the beginning about systems being too decentralized? This is how it plays out. Just as Gluster's scale was limited by poor solutions to a coordination problem (mostly at the management-daemon level), other systems at a different scale are almost inevitably going to be limited by all of these pieces contending and competing with one another. You can only let fly and hope for the best for so long before you have to get serious about bringing some order to the system.
My conclusion, after having worked on many more distributed systems than just the two I've already mentioned, is that ad hoc solutions to the internal-contention problem don't work. It does no good to implement careful rate control in each service. That just adds complexity, becoming part of the problem rather than part of the solution. Implementing rate control at a lower level (e.g. the storage servers) with back-pressure to the higher-level services issuing background requests, is slightly better but still imperfect.
The Solution: Distributed Work, Centralized Scheduling
We have a conundrum here: scalability and separation of concerns drive us toward a proliferation of services (yes I do get that), but that poses problems for correctness and consistent performance. I believe the solution to this conundrum lies in distinguishing between work and management. Moving data, scanning data, transforming data - these are all expensive even at the single-operation level, consuming and depleting the same resource (IOPS or MB/s) that we provide to users. They should all be distributed as much as possible and as evenly as possible. To do that, we can't leave the decisions about which moving/scanning/transforming to do where or when up to every individual service or function. That part should be more centralized, to avoid hot spots or lower-priority actions interfering with higher-priority ones. Note that I say more centralized, not completely. As with anything in distributed systems, work needs to be divided and information shared to maintain availability in the face of failures. However, coordination among a small set of schedulers can be much better aligned with policy/goals for much lower cost than coordination among a much larger set of services and tasks doing many different things.
What this ends up looking like is a lot like systems designed to deal with interoperability between different communication protocols or data formats. Translating directly between any 2/N protocols or formats, where N is large, is a mess. It wastes time implementing O(N^2) converters, and creates its own kinds of incompatibility as each converter has its own quirks. Instead, people have learned the hard way to use a "fan in, fan out" approach. Convert originals into a single format, hopefully one expressive enough to capture the union of what other formats can express, and then convert out to whatever final form is desired. Similarly, the scheduling layer sits between M services generating requests and N services satisfying them, needing to consider ordering or balance among only M+N entities instead of M*N.
But doesn't this add a "hop" to every such request, and thus degrade performance? Well, yes and no. For one thing, keep in mind that I'm talking about scheduling of background/maintenance requests here. I wouldn't necessarily recommend running user I/O through the same scheduler. Users want what they want, and should as much as possible get it. We can't control that, and shouldn't try except to protect other users or the integrity of the system as a whole. Also, even if scheduling decisions are centralized, that doesn't mean the requests actually have to flow over the network twice. There are all sorts of feedback mechanisms and token schemes that can be used to implement global traffic control without routing requests through the schedulers themselves.
Evaluation and Conclusion
So how does this work out in terms of the two problems - correctness/testing and consistent performance - mentioned above? In terms of correctness, the regular scheduler can be replaced for testing with one that generates pathological orderings of requests instead of optimal ones, with no change necessary to the individual services. That makes it much easier (and faster) to test edge cases. Admittedly, it doesn't help with the sheer number of such cases that might exist, but it's still better than hoping they'll be caught in a "free for all" environment.
With respect to consistent performance, we can get even closer to ideal behavior. Is a particular service crowding out others? Stop accepting so many of its requests. Is a particular storage server getting overwhelmed? Stop issuing so many requests to it. Either way, there's a clear line between policy and effect, with the effect strictly enforced. The rebalancing service can't crowd out the repair service (very bad!) or keep on overloading some set of storage servers, because the scheduler won't let it. If you want the scheduler to be only a safety net, intervening only to avoid total meltdown, you can do that. If you want the scheduler to be more active, even sacrificing overall performance for the sake of predictability or multi-tenant isolation, you can do that too. Instead of being at the mercy of a years-ago design, often taken in haste, you can actively tune the system in useful ways.
I'm not against microservices. Every architecture has its characteristic challenges. What I'm saying is just that those challenges have to be acknowledged and addressed. Microservice-based architectures don't have to be like traffic without stop signs or signals, plagued by gridlock and people cutting each other off and sometimes crashing. A tiny bit of centralized control in an otherwise distributed system can go a long way to make everything safer and smoother for everyone.