Memoirs: Facebook Second Half
October 28, 2020When I was forced to leave the POSIX team at Facebook, I faced a fork in the road. One option was to stay with the domain and the people I knew within the storage organization, which included four other projects that were more-or-less close to what I'd been doing on Gluster. The other option was to stay with a style of development that I knew, even if it meant switching domains a bit. For example, Facebook does a lot of interesting things with MySQL. It's an older and more traditional kind of codebase, much like Gluster. (In fact, I had done a coding task for MySQL while I was in bootcamp so I even had a little familiarity). It's open source, and the team was very accustomed to collaborating remotely. By contrast, the projects within storage were all written in an extremely Facebook-specific style, and some of those teams were known to have struggled with cross-site work (let alone individual remote workers). I do wonder whether things might have gone better if I'd chosen differently, but I don't really think so. As it turns out, the first two people who tried to "recruit" me were both in storage, and I ended up in the "block pod" for something called Warm Storage.
Warning: there are some presentations out there about something called Warm Storage at Facebook, from 2014 or so. Don't read them. While there is some relationship, the thing I worked on bore almost no resemblance to the thing described in those presentations.
Obviously I can't say as much as I'd like about what Warm Storage is or how it does what it does, because those are trade secrets. I think I can safely say that it was designed to replace HDFS, at least initially copying a lot of HDFS's semantics and design. Fortunately, it has evolved a bit since then. Semantically, it's a bit more than an S3-style object store and a bit less than a real filesystem.
The other thing about Warm Storage is that it's big. Each of our production clusters was bigger than I've ever heard of any open-source solution supporting, and we had quite a few. It's the base layer for multiple other storage systems that together hold most of the bytes at Facebook, making it (in aggregate) one of the three or four largest storage systems in the world. That's actually pretty exciting. Here are some things that seemed interesting, compared to other systems I'd worked on.
- High levels of automation are essential. Both planned and unplanned events affect too many components to deal with them manually. Even "manual intervention" usually means an ad hoc script that gets sent out to a list of hosts for execution.
- A cluster is never in a perfect state. There are always some hosts that are down, either for known or unknown reasons. "Health" is a statistical concept, as a percentage (less than 100%) of hosts that are up.
- Similarly, such a system is never idle. There are always hosts going in or out or being upgraded, various kinds of background scrubbing or optimization, etc. Even without any user I/O at all, there's still plenty going on.
- With a system this big, even a rack is a single unit in larger power and network fault domains. This matters a lot when you're trying to figure out how to place data for maximum protection against faults, and sometimes for performance as well.
- "Bit rot" is a real issue at this scale. One of the subsystems I worked on was the one that constantly scans for such errors. They weren't frequent, especially after you factor out the instances that were actually harbingers of a disk going bad, but enough to be worth the effort.
There were two areas that I specialized in during my time with Warm Storage. One was testing, which is kind of funny to me because back in 1989 I'd been in QA and then again in 2019 there I was sort of doing the modern equivalent. Without going into too much detail - because it's boring even to me, honestly - I spent a lot of time keeping tests healthy, plus adding new test-related features such as fault injection and ways to test time-dependent code.
The other area I worked on was data placement, especially rebalancing. Even initial data placement in a system this big is an interesting problem, mostly because you're not dealing with a single location per item. Rather, blocks are split up and erasure-coded across a set of locations, which brings in constraints like fault domains and copysets. Rebalancing is even more complicated. Those constraints are even harder to meet when you're dealing with a smaller set of source and destination disks, or when you're only moving one erasure-coded chunk constrained by the immutable locations of its peers. Then there are new constraints, such as not filling up larger disks with hotter data, resulting in an I/O imbalance. This is exactly the kind of problem I've always enjoyed - complex, with important practical implications but also expressible in a fairly abstract form to try out different solutions.
So, with problems that I enjoyed and people I liked and great pay, why did I leave? Because even the best work is still work. For every hour I got to spend exercising creativity to solve new problems, there was another hour spent fighting with build/deployment systems, or dealing with random infrastructure breakage, in meetings, code reviews, on call, performance reviews, etc. Most of these aren't even particularly Facebook-specific (except for the issue of everything being broken all the time which I'll get to some other time). They're necessary and part of being a professional software developer. In the end, the desire to get away from the bad parts outweighed the desire to stay close to the good parts, and once I decided that I didn't need to keep doing this I stopped.
Coming next: now that all the stories are told, I'll do a couple of wrap-up posts - probably one more technical and one less so. There are some patterns and lessons here. I also feel like I've already written a (short) book's worth of material, so maybe I'll add things like a foreword and a table of contents and even an index. I don't delude myself into thinking there's much of an audience for such a thing, but it might feel more complete that way.