Memoirs: Last Days at EMC, and Revivio
October 19, 2020After MPFS shipped at EMC, I left the project and started working in more of a research role again. My main project was a hierarchically distributed block cache, which I called C3D for no reason I can remember. A surprising amount of detail, even down to packet formats and such translated into legalese, is in the main patent associated with this project. My original vision was to use this as a global (as in geographically global) block layer, along with something like MPFS to manage metadata, ultimately yielding a global filesystem - a long term goal for me at the time. The main source of interest within EMC was people who wanted to use it as a way to extend/offload a disk array's cache. These goals were not incompatible, so I kept chugging along.
This is also the period when I was attending the UC Berkeley retreats (where we discussed projects like Recovery Oriented Computing and OceanStore), getting involved with the P2P community, and blogging a lot.
Then the political warfare between the Israeli and Irish factions within EMC engineering came to Cambridge. My boss and his boss, the latter the director of the Cambridge organization and both friends, were considered part of the Israeli faction, having been put in those roles by the previously mentioned Erez Ofer and his boss Moshe Yanai. When the Irish took over, both of my friends were very suddenly let go. On the same day, I was put on notice that my own work - previously approved at least up through Erez - had to go through a new project-review process. At best this would have been a huge pain in my ass, but my examination of both the process and the people involved in it convinced me that the process was designed to stamp out the kind of innovation I was attempting. I was also rather annoyed at the way my friends had been treated, so less than 24 hours later I had already accepted an offer to work with some of my Clam friends (again) at what would become Revivio.
When I joined Revivio, it wasn't Revivio. It was Mariko, but that was about its fifth or sixth name; they were still in semi-stealth, and the name changes were a tactic to throw would-be competitors off the scent. My paychecks said Terastor. Revivio's technology was Continuous Data Protection (a term they coined and popularized before others figured out how to profit from it). Instead of relying on backups, CDP would let you go back to any point in time instantly - infinite "recovery point objectives" and zero "recovery time objective" respectively.
To provide this functionality, our appliance had to keep a running log of every I/O hitting the system. There are multiple ways to do this, of course. One way would be to have a fundamentally log-structured or write-anywhere authoritative copy of all writes, with indices for the current state and snapshots. Another would be to have a write-in-place authoritative copy, with any overwrites doing copy-on-write into a separate historical store (a rollback log, in DB terms). The log-structured approach would almost certainly have been better for performance, but Revivio chose the WIP/COW approach because it was considered extremely important to have the current copy be directly usable without an index. "You could unplug our system and have a usable current volume" was the goal.
Because time was so important in this system, we needed to have very tight time synchronization between the I/O nodes, and a surprising amount of our engineering effort was related to this. It's part of why we used both RTLinux and InfiniBand, so that the latency between nodes specifically for clock synchronization could be as low as possible. Having worked on SCSI for so long, I had always felt that the Fibre Channel specs were ridiculously over-complicated ... until I saw the InfiniBand specs. What a total over-engineered nightmare. It didn't help at all that there were multiple vendors pushing multiple APIs and subnet-manager solutions at the time. Nowadays that's all pretty well standardized, but back then it was the wild west and we had to contend with it. I got another patent for the time-synchronization stuff, which I'd had to take over rather forcibly (I was product architect so I could do that) after the original engineer botched it.
Here are some of my favorite challenges from that time.
- We had a bug where timeouts would stop working after 24.85 days. Why such an odd period? Because the Fibre Channel target driver we used (from Breakthrough Systems) used a 32-bit millisecond timer. Signed 32-bit. Do the math.
- We had InfiniBand cards that would work fine when for a while, then start to get flakier. Let's put that another way: they worked fine when cold, but not when hot. Our VP of Engineering (hi Rich) didn't generally get involved in debugging efforts, but after a few weeks he got out a magnifying glass and found this one.
- Just getting our devices recognized properly by hosts was often a challenge. For one thing, when we popped a new snaphot into existence we had to issue a particular kind of reset (Fibre Channel defines four) so that hosts would rescan. They'd do this by issuing Inquiry commands, which would return things like vendor ID and serial number. Unfortunately, the companies that contributed to the FC standards couldn't agree on things like format or encoding for these values. Instead of choosing one, they chose to support six formats and four encodings with extra bytes to say which were present. Writing the routine to determine "equality" across all these permutations was as much fun as modern i18n with multiple ways to represent accented characters and other language-specific lexicographic conventions.
- We were a very early Coverity customer, and I might have been the first person to use their extension API to define new "checkers" - in our case for lock-hierarchy violations which had been a constant source of pain. Licenses were expensive, so every time I fixed a bug that Coverity had found I added a log message identifying it. When renewal time came, this made it much easier to come up with an estimate of how much engineering time/expense we had avoided by fixing these bugs up front.
My favorite Revivio bug hunt, perhaps my favorite of all time, was the kernel stack-smashing bug. This actually went in three stages.
- At first, we just had random hangs. Not crashes, so we couldn't get crash dumps to look at. A host would just go unresponsive. I was sitting around thinking of the days when I'd been able to use in-circuit emulators and such for problems like these. If only we at least had a way to inspect the memory of a system even when it was hung ... but we did. InfiniBand required pre-registration of remote memory so it could do remote DMA. So I figured out how to register the memory containing the kernel and loaded modules. Then I wrote a tool to reach out from another node and grab all of that when a node hung, massaging it into crash-dump form (some interesting ELF work there).
- The few dozen crashes I collected this way were still hard to understand. Stack traces would work for a few frames, then turn into garbage. Sometimes they'd look like two traces mangled together. I repeatedly questioned whether my memory capture was really working properly, but finally I figured it out. Linux made the abysmal decision to put each process's fixed-size stack right above its task structure. Above. So if the stack overflowed, it would normally smash the task structure ... but then nothing would have made sense. What was actually happening was that one task's stack was overflowing by so much at one time that it was jumping over its own task structure and right into another task's stack. Hence the two traces mangled together. When the "victim" task unwound its stack far enough, it would jump into limbo, but the "culprit" task might be long gone by then.
- It wasn't hard to identify the two or three functions that were allocating more stack space than they should (one was multiples of 64KB when the stacks were only 8KB) and fix them ... but were there more? For the final act of this little play, I wrote a script to disassemble all of our code (including third party) and look for the stack-allocation code. I think I found about thirty functions that were allocating enough memory to cause this problem. This is why you don't let user-space weenies write kernel code.
Unfortunately, all of these tales of legendary bug hunts probably hint at a darker truth: we were spending so much time debugging that we didn't have much left for things like performance, and we kept slipping schedules too. This didn't make things any easier for the business side, which had its own problems. For one thing, our VP of marketing insisted that we target the high-end high-margin part of the market first, which is just never a good idea for a startup and even less so in storage. This is the same guy who came up with our impossible-to-pronounce company name (try spelling out "Revivio" without sounding like Old McDonald) and gotten us in trademark trouble with Citrix by sending his personal Christmas greetings out over our allegedly-infringing logo. What a tool. In the end, we couldn't generate any actual bottom-line revenue selling the product we had to the customers we were pursuing. Demo units, sure, but not revenue.
In the end, the VCs pulled the plug and Revivio was sold as a fire sale to Symantec. This was less than ideal from my point of view, so I turned to another friend - this time not from a company I'd worked at but from the Appalachian Mountain Club - for my next gig.