Distributed Functions
December 28, 2018I've been thinking about better ways to write distributed-system code for a long time. I've talked to quite a few people about some of these ideas. The Christmas break seems like a good time to sit down and write them out in a bit more detail.
My basic motivation is that I feel I've wasted too much time already dealing with code that's splattered across many different functions with different scopes/environments so that they can run across many machines. (BTW lambdas and continuations are absolutely no different in this regard, amounting to little more than syntactic sugar for the same thing.) Such code is hard to write, and even harder to read or reason about. If it's hard to reason about, it's hard for humans to modify it safely and productively. It's also makes testing, static analysis, or formal verification harder. It's not that these things can't be done. I've had a long and lucrative career doing them, because I'm fortunate to have those skills. My point is that, even for people who have the ability, these tasks are unnecessarily tedious and error-prone. I'd rather spend more of my time building new stuff instead of having to debug (my own or other people's) old stuff.
Let's look at how this tedium affects even the very simplest kind of distributed program. Imagine a basic chat system: you type something, and everyone else in the same room/channel/whatever gets to see it. So there are clients and at least one server. The clients can be further subdivided into originating clients and responding clients - you typing the message and others seeing it, respectively. Using standard RPC techniques, you'll have to create at least four things. First is the RPC definition.
sendNewMessage (string text);
forwardMessage (string user, string text);
Note that even in this example the RPC subsystem is doing a lot of grunt work for you - packing and unpacking arguments, managing connections, etc. There are even worse methods still in use. We've already come a long way, but I hope to show that we're still far from where we should be. Now you need to create a function for the originator to send a message.
userTypedSomething (string text) {
// Assume 'server' is predefined somewhere, and server knows who we are.
try {
server->sendNewMessage(text);
}
except (originator RPC errors) {
...
}
}
Then you need to create the server's handler when a new message comes in.
sendNewMessageHandler (string user, string text) {
// Assume clientList is predefined somewhere.
foreach (client in clientList) {
try {
client->forwardMessage(user, text);
}
except (responder RPC errors) {
...
}
}
}
Finally, we have to define the responder's handler.
forwardMessageHandler (string user, string text) {
displayOnScreen(user, text);
}
Phew. That was tedious, wasn't it? A lot of text to describe something that was probably obvious to anybody who has used RPC before ... and that's the point. As with the text, so with all RPC code ever. It's a pain in the butt. Note that this is a deliberately trivial example, and even so many details have been elided. What if there's a non-RPC error (e.g. permission denied) somewhere in this sequence? What do the makefiles have to look like to build this? What do the tests have to look like?
Now imagine something more complicated, like a distributed lookup that has several stages and some of those stages have to do a fan-out/fan-in across multiple connections. There are several places in Gluster where that means thirty functions all representing different parts of what is basically one not-that-complicated algorithm. Let's take a look at something that might make this easier by reducing the "friction" each time the flow of control has to move from one machine to another.
userTypedSomething (string text) {
on (server) with origUser = user {
foreach (client in clientList) {
on (client) {
displayOnScreen(origUser, text);
}
except (responder RPC errors) {
...
}
}
}
except (origin RPC errors) {
...
}
}
The key piece here is the "on" statement, which represents moving the flow of control from one machine (process) to another within the same function. Along the way, it automatically figures out which variables from the caller's scope need to be forwarded to the callee (e.g. 'text') and packs/unpacks them appropriately. The "with" modifier is necessary to resolve ambiguity. By the time we get to displayOnScreen, 'user' would refer to the user on the responding client, but we need to know the originating client. The "with" modifier is how add variables to inner scopes in addition to the ones that can be found automatically.
The other thing that's still missing is information about where things are. Now we're ready to look at the declarations that pull all of this together; consider this prepended to the previous snippet.
role (client) {
string user;
connectionType(role=server) server;
func userTypedSomething (string text);
func displayOnScreen (string user, string text);
}
role (server) {
connectionType(role=client)[] clientList;
}
Don't worry too much about the exact syntax; I'm not tied to it. The important point is that these declarations let us compile with "role=client" to get all of the functions and variables needed in a client, or "role=server" to get all of the server functions/variables instead, without needing separate source and RPC-definition files. Also, note how the code that runs on the server in our first snippet doesn't need to be identified as such. The compiler can figure that out from the fact that it's within an "on" statement where the controlling variable is a connection to a server. Lastly, this syntax combines pretty well with fork/join syntax to provide transparent parallelism e.g. between the server and multiple responders.
So what are the benefits of this approach?
- It's easier to write and modify. For example, imagine if you wanted to add a timestamp to each message. With classic RPC you'd have to make multiple edits across multiple files to update the definition and each handler. This way you can just add an argument to userTypedSomething and it would just "flow through" to the server and responders. Or consider the looping across connections that creates such "callback hell" with classic RPC. Here it's no more complex than the loops we're already used to.
- It's easier to test. Even though the code is written to run across multiple machines, it maps in a trivial and obvious way to a sequential and/or single-process form for testing without losing too much real-world fidelity.
- That same easy mapping also makes it easier for various kinds of tools to operate on the code. Static analyzers can detect leaks and dead code. Formal methods can be used to check for deadlock/livelock and various correctness conditions. This stuff is all super hard when code is "atomized" across many functions to work with traditional RPC. With a single function and explicit information about what lives where, it becomes a whole lot easier.
Clearly, there are a lot of details still to be worked out here. I'll admit that the "with" business is something I came up with as I was writing this because I realized I needed it. There are undoubtedly other issues around variables, scopes, and automatic forwarding between them that still need some thought. There needs to be a "transport provider" abstraction not only to handle connection management and variable packing/unpacking but naming/discovery and identity issues as well. In fact, the entire security model is kind of TBD. However, I think this should give people some idea of how distributed-system programmers' lives could be made better with a different model, and one model we could consider to meet those goals.