Memoirs: CLaM part 2
October 12, 2020In early 1994, George (the CEO at Clam) came to me with a problem. IBM was coming out with a new dual-controller disk array, which eventually hit the market as the "Raidiant" 7135/110. They needed someone to write an AIX multi-path driver for it, and had asked us if we could do it. In all honestly we couldn't, since it involved SCSI expertise that we didn't have, but George didn't get his "go for it" nickname by letting such obstacles get in his way. He asked me if I could become the in-house SCSI expert, and I said yes. There might be more stories around that driver (eventually called REACT) than about any other single project in my career.
- SCSI at that time came in multiple flavors. Vanilla SCSI-1 was 5MB/s, "fast" SCSI-2 was 10MB/s, and "wide" SCSI-2 was 20MB/s. We were using then-very-new wide SCSI, with thick 68-pin parallel cables. These cables were quite hard to find, and were so heavy that strain at the connector would cause them to wear out quickly. We went through many dozens of bad cables - enough that the expense was a real issue. I became quite good at recognizing cable problems in amongst all the other things we were testing, but not before they'd cost us a lot of development time.
- Then-current SCSI also required explicit ID assignment and proper cable termination, which created other classes of bugs to slow development.
- Since this was pre-release hardware being kept very secret, I had to work for many months alone in a windowless room to which only George and I had keys. With all of the equipment in there, it got hot. I was really tempted to work shirtless, because nobody would have noticed, but never actually did.
- Late in the project, I came out of my cave to get a soda or something, and ran into an employee I'd never met before. "You must be the new guy" they said. Considering that I'd been there longer than ~80% of employees at that point, I found that both amusing and disturbing.
- The project was run out of IBM in San Jose, CA. The IBM disk-drive people were in Rochester, MN. The RAID controller was an ADP-93 from NEC (later spun out as Symbios then in again to LSI) in Wichita, KS. The whole thing was put together by IBM's "Adstar" division in Havant (near Portsmouth), England. IBM's workstation division (in Austin, TX) was also involved. All too often, tiny little Clam got stuck in the middle playing referee between all of these larger entities.
Then there's the whole "pen in the fan" thing. To convey the full horror of this I'll need to go pretty deep on SCSI minutiae, but I'll try to do it in such a way that you can skip to the next paragraph if that's too much. SCSI then, like USB today and probably always, was very asymmetric. You had "initiators" (computers) driving requests to "targets" (devices) and never the other way around. There was no good way for a target to signal a change in its condition, such as a hardware fault. AEN (Asynchronous Event Notification) was in the specs, but I never saw an adapter or driver that supported it. Instead, if a fault occurred, the target would set a contingent allegiance condition on each affected ITL nexus - Initiator/Target/LUN, where LUN is Logical Unit Number and was colloquially (if a bit incorrectly) used to mean what most people would call a volume. Did I mention that SCSI had a lot of crazy jargon? When any command was issued to an object with a contingent allegiance condition, it would get back a check condition status and would have to issue a request sense command to figure out what's wrong before continuing. Another thing I got really good at was interpreting those sense codes. Then, in our case, we had to issue some more commands to the previously passive controller to take over the LUNs that were no longer accessible on its partner - a mode sense, flip some bits, send it back as a mode select. This process could take up to a minute, during which time we had to keep upper layers from blowing their timeouts (typically 30s) and retry counts lest we commit the unspeakable sin of actually failing a user's I/O request. There were also details about restarting or aborting queues, queue tags, etc. As I'm sure you can tell, this all made for a pretty complex state machine.
The "pen in the fan" problem was the result of some miscommunication between the people who designed the hardware and the firmware on the ADP-93. System-wide faults (e.g. power or temperature) were reported on eight wires, then multiplexed down to three wires before the firmware saw them. But the firmware authors had read the eight-wire non-multiplexed version of the hardware spec, and treated each wire as a separate fault indicator. This meant that a fan fault with the code 0x7 appeared to them as three separate faults - none of them properly identified as the fan BTW. Then these same three faults were reported on every LUN, creating a huge storm of apparent faults from one actual event. That's why I spent month after month sitting in a hot windowless room, sticking pens in fans (sometimes pencils for variety) and waiting to see if all of the software involved would settle into a reasonable state.
That brings me to the Havant trip. It was my first significant international job-related travel, and I guess I made the most of it. Some more stories...
- Our favorite restaurant was called "Fist Full of Tacos" - yes, this was in England. Don't ask.
- On our first night at FFoT, for some reason, I established a reputation as a hard drinker. I remember a strong cider, then a chili-infused beer, then a jello shot. References to this were made for the rest of the trip.
- One night, we stayed out too late and got locked out of our rather old-fashioned hotel. I told the others to wait, while I went around the back and picked a lock so I could come and open the front door. They were aghast. My reply? "I wasn't always a cop." My reputation either rose or fell another notch, never could tell which.
- Near our work area was a testing robot, designed to move over the backplane of a disk drive and deliver small electric shocks. It was a pretty impressive level of testing TBH. The robot was surrounded by caution tape, because there was a bug that would sometimes cause its arm to shoot out to its full extent very suddenly. This had apparently caused at least one injury already.
After a couple of weeks spent exercising all of that fancy error-handling and failover code, I learned my most important lesson of the trip. It came from one of the IBM engineers, who I can remember had lived in Idaho and put himself through college making lava lamps, but unfortunately I can't remember his name. Anyway, we'd reached the point where we could reliably recover from the pen faults and everything else we could think of, but sometimes the system would flap around for five minutes or so to get there. I thought that was pretty impressive. He took a different view.
It's better for a system to fail quickly and cleanly than to persist in an unknown state for any length of time, even if it eventually recovers.
He was right. The decision was made to attempt one controller failover in response to a fault, and if that failed - as much as it galled me - allow I/O to fail. That got us to the product release, where we were congratulated on not only having the world's highest storage density (a whole 135GB in approximately a two-foot cube) but also having a solid availability story to go with it. It was a pretty big victory, and a harbinger of many things to come.