Memoirs: SiCortex part 2
personal experiences
October 23, 2020After I'd been at SiCortex a while, we actually started trying to sell systems to customers. Crazy idea, I know. One of the first systems (the first?) was installed at Argonne National Laboratory, in Illinois. While I wasn't involved in the actual install, I ended up going there later because they wanted to try using PVFS2 instead of Lustre. So it was that I flew in to Chicago just before an ice storm and drove to my hotel near the site. Traffic was crawling and sometimes even stopped even on the freeway. That's also when I discovered that someone at Enterprise rental cars had filled the wiper-fluid reservoir with plain water which froze as soon as it hit the windshield. The only way I could maintain visibility was to blow maximum air at maximum temperature onto the windshield, and keep the side windows open so I didn't roast to death. This is why I'll never rent from Enterprise again.
Conditions were even worse on the second day of my visit, a true blizzard which is kind of ironic since that was also the name of the system I was working on. I was seriously worried that I wouldn't be able to get from Argonne to my hotel a couple of towns over, but I did - barely. In any case, I got to meet a lot of really cool people and learn about PVFS2, then it was time to head home. First I stopped to visit family in Michigan - and got hit by the same storm a second time. Then I finally went home - and got hit by the same storm a third time. I was just barely ahead of the storm this time, and the only way a cab would go was if a bunch of us shared a ride. I ended up being the last drop-off out of four. The cabbie was crazy, zooming down unplowed side streets and so on. We were all terrified. Ah, the joys of business travel.
The other "fun" customer for me personally was the Laboratory for Atmospheric and Space Physics, at CU-Boulder. They were using one of our machines primarily to study climate-change stuff. We had a bunch of customers or potential customers doing cool stuff that I was glad to be associated with, like climate science or computational biology. We also had some who sparked a bit less joy, but I'll get to those. In any case, LASP had a couple of interesting problems. One was a particular piece of bad Fortran code, which had hundreds of processes come off an MPI barrier in the same microsecond and all try to create/truncate the same log file. This caused massive contention in the MDS, which - as I mentioned in my last post - was already prone to falling over. So yes, I got to do a tiny bit of work in Fortran to fix it.
The more interesting problem at LASP was the brownouts. The system kept failing with link errors on Friday afternoons, because that's when the scientists who had slacked off the rest of the week would load the system trying to get something to put on their status reports. I got sent out there not because these failures seemed to be in my domain but because I was available and considered minimally competent at getting information out of the system. It was the first time I actually got to see Boulder, by the way. I'd been out there a couple of times in the MPFS days, but all I really got to see on those occasions was the inside of a hotel conference room. The first day this time, we ran a bunch of tests to little avail, then decided to call it a day so we set up a job to run overnight and left. Minutes later, as I was walking and looking for places to have dinner, things started failing but it was too late to do anything until the next morning. Next day, same thing. Grrr.
Our first theory, this being Boulder, was that it had something to do with memory errors. Cosmic rays are a real thing at that altitude, to the extent that (as rumor had it) one of the big installations at either NCAR or UCAR had been seriously delayed until they figured out how to deal with them. We had ECC memory, and spent some time fiddling with EDAC (Error Detection And Correction) driver settings to control how often we scrubbed memory so that single-bit errors could be corrected before they turned into uncorrectable errors. All to no avail. That wasn't it. The problem was altitude related, but far more insidious.
- At altitude, the air is less dense. Therefore, it can carry less heat.
- Power supplies become less efficient when they're hotter.
- When the power supply gets less efficient, it supplies just slightly less current than it should to the other parts of the system.
- For whatever reason, the "other part of the system" first to be affected was the part that drove the network links between nodes.
- Links would start failing, and the failure of a node's outgoing link would cause backups on the same node's incoming links, eventually spreading throughout the whole system because of the way we routed through the nodes themselves.
- Nobody could talk to anyone, system is not functional.
I honestly had very little to do with debugging this problem, beyond taking some measurements at others' direction. Kudos to the home crew (hi Kem!) for that. The fix turned out to be a one-line change to a fan configuration file.
So, what about those other customers I mentioned? One was NSA. They were actually a really good customer in a couple of ways: they paid cash because they knew the machines were never coming back out (no lengthy demo/trial periods), and they rarely called for support. We were also working with the "defense" who were kind of the good guys within NSA, but I still wasn't totally thrilled. I also remember coming in at the tail end of another meeting where someone was saying "they're like NSA's shadier cousin" but never found out who they were talking about. The other market that we were trying to break into was the "oil patch" - some terrestrial oil/gas exploration, but mostly marine for an interesting reason. They wanted to put one of our machines on the ships that they sent out to do initial soundings, so they could do analyses in situ instead of having to move data home. That energy/space efficiency appealed to some surprising groups for surprising reasons. We had another feeler from western Australia, where they wanted very low-emission compute because standard equipment would interfere with listening for signals from space. I don't think that one ever got sold, though.
The story of SiCortex's demise is one of the saddest of all the companies I worked at. We were planning a second generation, which was going to be awesome. Doing all that hardware in-house was going to be expensive, so we went to get more funding. In 2008. Those being rather bad times economically, standard sources of funding weren't available. We tried to get funding from non-traditional sources, including our customers, instead. As I understand it there was enough money out there to meet our goals, but no single entity would commit to leading a round and that's the way things have to work so everything fell through. Then some previous debt came due, we literally ran out of money, and boom. It seemed very unfair to a lot of people who had set out to do something very ambitious, who had succeeded in solving some very hard technical problems, who had done everything they were supposed to, and who had merely fallen victim to bad timing. All this while bullshit no-new-technology no-real-revenue internet or "big data" companies continued to attract ten times more money than we had ever asked for. Bitter? You bet. That was how I ended up being actually unemployed for the first time since Michigan two decades earlier, and ultimately employed again at Red Hat.