Memoirs: CLaM part 1
October 09, 2020In 1992, I went to interview at a little company in the Lechmere section of Cambridge, with the rather unconventional name of CLaM Associates. The odd capitalization is because it stood for Comeau, Linscott, and Miller. Les Comeau was an IBM old-timer, having been the authors of VM/370. (Fun fact: shortly after I started working there, I found out that my future in-laws had known him.) I think George Linscott, the only one of the three I ever really interacted with, had also been there. I never met or even saw Miller.
My interview process sort of set the tone for much of what happened later. In those days, casual wear was still the rule even for interviews at most companies, but the more conservative ones might still look askance. Based on what I'd been told about Clam (I'm going to abandon the weird capitalization from here on out) I decided to err on the side of caution and wear my interview suit for what I think turned out to be the last time. I showed up at the appointed time ... and nobody seemed to be there. "Hello?" I shouted into the empty space. After a while a vision appeared. It was my first time meeting Mike, with whom I went on to work at two other companies. He was dressed in shorts, flip-flops, a ripped Dead Kennedys shirt, and a big feathery ear-ring. I felt overdressed. I'll let Mike decide whether to tell the story of exactly why he had taken so long to come out and meet me. ;) In any case, I don't really remember how the interview itself went, but I do remember talking to George while he had his brand-new cowboy boots propped up on the desk. And eels, which were actually Mike's but were in George's office for some reason.
Clam's big claim to fame was HACMP/6000, which provided high availability clustering for IBM's RS/6000 computers running AIX. You have almost certainly done business with a company that was using it, since customers included many banks and retail stores. It was a pretty classic active/passive failover kind of model, with multiple kinds of heartbeats (including a serial line) and disk fencing. Everyone turns up their noses at that kind of thing nowadays, but consider the time. I am writing these memoirs in part to preserve a tiny piece of history, after all. Some of the fundamental algorithms that underpin modern forms of high-scale data/service sharding didn't exist yet. For example, the papers about consistent hashing as we currently know it weren't published until 1996 (Thaler et al rendezvous hashing) or 1997 (Karger et al). This highlights what I think is a very important observation about computing over the long term.
The best ideas from one generation are the most likely to be taken for granted and treated as "obvious" by the next.
No, it wasn't obvious in 1992. In fact, three of us at Clam very nearly invented consistent hashing ourselves in 1993 or maybe 1994 - not quite, I'll admit, but close. Somebody smarter than any of us had to invent it so that it could be "obvious" today.
My first project was to take our distributed lock manager (based on the semantics of the one from VMS which had become kind of a standard) and port it from user space into the AIX kernel, for performance reasons. There was a lot of skepticism about whether this could even be done, or whether it would actually yield any performance benefits - not least from other quarters with IBM. (I'll probably talk more about IBM politics in my next post.) In fact, one rival had sent an email listing twelve reasons why this was a total waste of time. I printed it out and kept it over my desk as motivation. In the end yes, it was possible, and yes, it did provide significant performance benefits. I had to do a lot of tricky things to make it work, some of which were met with significant disfavor by the AIX developers, but nobody could really argue with the results.
The other story having to with that lock manager also involves Oracle again. As part of our work, another engineer and I had to work at Oracle HQ for a while. Back then, it was just three green towers next to a huge mud flat. We got stuck in a dingy little lab with no windows. I remember there was an HP machine next to us that was in a boot loop and would beep every few minutes. It was annoying. After a while, I figured out how to tell it where its boot disk was, and it booted successfully. Blessed silence. Several minutes later, a couple of Oracle engineers walked in. "I don't know, it's been broken for months. No idea why it suddenly started working." I don't know if they actually noticed me - I was right there after all - but I never even bothered to look up.
But that's not the real story. The real story was that we were doing performance testing and something kept perturbing our results. There seemed to be a lot of I/O going on that wasn't ours. Digging in, we eventually found the culprit: a "tar" process making a copy of our source code. Correction: IBM's proprietary source code, since we were on contract to them. The terms of our collaboration specifically forbade them making a copy, much as Encore had been forbidden from having a copy of Oracle code a couple of years before. I don't know exactly where the order to make that copy came from, but it was at least one level above the engineer we were working with. We had to do the rest of our work on systems physically isolated from the rest of Oracle's network, which was a pain indeed, but they'd made it necessary. This was the second incident that led to my special loathing for Oracle.
My next two projects were also HACMP-related: a communication layer for the next generation of that same lock manager, and a "cluster manager" (membership + heartbeats + failover coordination) to take HACMP all the way up to eight-node clusters. Again: that seems laughable today, but it was a pretty big deal back then. Getting a membership protocol to work reliably even for eight nodes isn't easy. This was the project that really sharpened both my appreciation and skills around unit testing. I remember having a kind of mock communication layer that would drop or reorder messages on demand, and a test harness that would drive the membership protocol through all permutations of such events. This was also my first exposure to model checking in the form of SPIN. I didn't actually use it, since I was already doing the same sort of state-space exploration "for real" with my test harness, but it put the idea in my head and I'll come back to that later. In any case, the net result was that I was able to develop very high confidence in each successive generation of the membership protocol. Being able to swap out an old version for a new one just days before a release was a pretty novel experience back then.
My friends from Clam would probably consider me remiss if I didn't also mention DGSP. One of the things a cluster manager has to do even at eight-node scale is detect "split brain" partitions - where two node subsets that can communicate internally but not with each other both start allocating and using resources independently. This usually involves a quorum rule, requiring that a strict majority of nodes be present before you actually start doing anything, and we add that. However, it was still possible to have two or more subsets that had each started the process of building a quorum and elected their own leaders. When the partition is resolved, quorum can only be reached by somehow joining those clusters into one or by "dissolving" all but one and having the individual nodes join the surviving cluster. We chose the second approach, and the message we used to initiate this process was called DGSP - Die, Gravy Sucking Pig.
As it turns out, somebody left some logging enabled in the release that was meant to be active only in testing. A customer noticed these "DGSP" messages in the log, at exactly the same times that things seemed to be a bit flaky, and asked about them. The support engineer came to my office and I explained what it meant. "I can't tell the customer that!" To his credit, between my office and his, he managed to come up with another believable backronym - Diagnostic Group Shutdown: Partition" Clunky, but it worked.
Next up: the beginning of my SCSI adventures.