Memoirs: SiCortex (background)
October 22, 2020For a long time, especially before our daughter was born, my wife and I were fairly avid hikers. Sometimes we would hike by ourselves, sometimes with the Appalachian Mountain Club. As it turns out, one of our favorite AMC group leaders was an early employee at SiCortex. During my period of discontent at Revivio, we got to talking about exactly what he did, and that led to me interviewing there. Obviously that led to me working there, or else I wouldn't be writing this post. ;)
Note: this is going to be kind of long, so it's divided into two parts. This part is mostly background on the system. The next part will be more of a timeline of things that happened while I was there.
The SiCortex machines were interesting, homegrown from the silicon on up for radical energy efficiency. "Most performance per watt, per BTU, per square foot" yadda yadda. Each "Ice9" node, which ran its own instance of Linux, was a six-core CPU based on the MIPS R5000 because that was what we could afford to license. (I asked about this, and was told that alternatives including ARM were too expensive.) This had some decent floating-point performance, so for highly parallel HPC code it did pretty well, but from an integer point of view we're looking at a single-issue processor with a low clock speed and fairly limited cache. It kind of sucked for the kinds of code that exists in the kernel or generic utilities. Often the first thing a customer would do is try to compile their code, which hit all of the processor's weak spots. They'd be unimpressed, even if their HPC code once compiled ran well. The Ice9 had been designed by a team that had previously worked on the DEC Alpha, which was very fast for its time. One of the ways they achieved this performance was by having the weakest memory-ordering model since Alpha, which gave us the opportunity to trip over several kernel bugs that had been latent since then. Good times.
In addition to the CPU cores, each node had its own memory logic (no separate support chipset like you'd find in Intel machines) and a direct connection to a fast internal network. This network was the real core of the SiCortex technology. Nodes were interconnected in a topology called a Kautz graph, which is hard to describe but has the interesting property that in a graph of degree k (i.e. each node has k incoming and k outgoing links and I never want to work on a system with unidirectional links again) there are guaranteed to be k completely separate paths between any two nodes. That's great for redundancy, and also for spreading load among those paths. All communication was source-routed through the nodes themselves, so again there were no separate chips for routing and so on. A board consisted mostly of 24 Ice9 chips plus memory for each, and only a few other things. The backplane was entirely passive. In fact, one of the most interesting mathematical problems in designing the system was designing a single 24-node board that could be plugged into different backplanes to make the diameter 4/5/6 machines we actually sold (108/324/972 nodes respectively). These boards were huge - about two feet square, and 27 layers IIRC. Here's a picture of the 5832-core "Blizzard" which was the largest version.
A fun story about the Blizzard is that we had one in a small demo room near the front of the office (in DEC's old "Mill" building which was pretty fascinating in its own right BTW). We'd put a dozen people in there, show them the gull-wing doors, run some code, let them feel the barely-warm air coming out of the top, etc. Then we'd point out that the people in the room were putting out more heat than the machine was. Had it been any other kind of hardware the room would have been quite toasty.
So, what was my role in all of this? I ended up being responsible for the low level networking code (looked like Ethernet to the rest of the system), for the storage subsystem which was based on Lustre, and for some related pieces of boot code. Almost all of the nodes were completely "headless" - just the CPU, memory, and internal network connection. To bring them up, this is what had to happen.
- A maintenance processor (one per board running its own different version of Linux) would use each node's JTAG interface to load a minimal boot image into cache (not memory).
- The node would "register" its memory, determining what timings actually worked. Most people don't even know this is a thing because it's handled in hardware, but not on our system.
- With memory now available, the node would configure its network and use that to mount its real boot image from a node that had a connection to storage. PCI logic existed in every node, but the traces to an actual physical PCI slot were only on a few. These few I/O nodes therefore became servers for the rest.
- The node would "pivot" to the real boot image, and start running normal kinds of code.
The original plan had been to mount boot images via Lustre, but that turned out to be awful. What we actually ended up using was read-only NBD (Network Block Device) mounts for the root, and then Lustre for application data. Getting maintenance processors and I/O nodes and regular nodes to come up in sequence and plumb everything together was quite a dance.
The problem with Lustre was that (at the time) it was based on a single metadata server (MDS) and many object servers. I don't know how anyone, even in those days, could design a "high performance" system in such a way and not be deeply embarrassed. In a typical deployment, the MDS would be the single most powerful system in the complex to avoid it becoming a bottleneck. Unfortunately, on our system the MDS was physically identical to the thousand clients. It also didn't have the "poor man's flow control" of a slow network. Did I mention that the SiCortex network was fast? Like Dolphin, they were well ahead of their time. Even when we were planning the second generation, where the nodes were estimated to be about 6x as powerful, there were no plans to upgrade the network because it would still be fast enough.
So, we had a wimpy MDS very exposed to this fast network. Requests could come in much faster than the MDS could actually handle them, and would immediately get deep into the code instead of being "held up at the gate" by the network layer. Crashes were not unknown, but the most frequent failure scenario was that the first wave of requests would time out and be retried, adding yet more requests to the queue. After a few iterations of that the MDS would simply lock up, mostly from failure to allocate memory. One of the lessons I took from this, besides the need for proper admission control and "back-pressure" on senders, was that systems which rely on multiple inter-related timeouts at different points in the system can't be trusted under load.
These kinds of MDS-overload problems were typically only seen at the very largest deployments, such as Livermore or Oak Ridge, and those were the only places they could be debugged. The Lustre team (which kept bouncing between different companies) didn't have enough equipment of their own to debug this sort of thing, so they had to borrow customers' setups. On the one hand, I kind of resented the fact that my taxpayer dollars were being used to fund their development. On the other hand, if that meant fewer bugs afflicting me in my day job I might have been all for it. In any case, we saw these "large system" problems constantly on our smaller system due to the specific nature of our hardware, and fighting those problems was most of my every day.
Next: Colorado, the spooks, the oil patch, and disaster