Memoirs: MPFS at EMC
October 16, 2020The MPFS project at EMC wasn't quite the most "impactful" of my career (that was probably REACT/ATF) and it didn't quite gain me the most recognition/notoriety (that was probably Gluster) but it might have been second in both regards. It was also one of the most unpleasant projects of my career, mostly because it was so deeply infected with EMC politics.
The Network Storage Group at EMC was actually the successor to the "Calaveras" group at DEC, with the latter name still quite visible in documentation and code. Their product was the Celerra, an NFS and SMB/CIFS server consisting of up to 14 "data movers" and up to two control nodes. The data movers ran their own semi-realtime operating system called DART. This all added up to a pretty capable server cluster, significantly more powerful than anything NetApp or others had at the time. In fact, the Celerra folks that they shipped more capacity than NetApp, though in fewer separate sales. I'm not sure that was true, but that's what they claimed. This being EMC, the actual disks weren't in the data movers but in a separate disk array - Symmetrix definitely, I think Clariion (which had by then been acquired by EMC) as well later.
The fundamental problem MPFS was supposed to solve, from NSG's point of view, was that they'd chosen a very simplistic model of which data mover would "own" which pieces of data. There was no provision for fine-grained ownership transfers according to usage, but only for whole-node failover. If a request came in to one data mover for data owned by another, the data had to be forwarded over a private back-end network. Even with the fastest back-end network they could support (made more difficult by the decision to build their own OS requiring its own drivers), this was a serious bottleneck. They needed to move that traffic somewhere else.
In those days, having "somewhere else" be the same SAN (Storage Area Network) that already connected the Celerra to one or more disk arrays was an appealing choice. Clients could also connect to that SAN, and Fibre Channel at the time was at least twice as fast as Ethernet.Clients could use that network for bulk data transfers while still relying on the Ethernet/IP network for metadata and coordination. Thus was MPFS born, and I still maintain that it was a good choice at the time even though it probably seems crazy now.
Originally we called it pNFS (p = parallel), which is funny because that's the name under which some of that work eventually survived as part of NFS standards. MPFS could have stood for Multi Path, Multi Platform, or Multi Protocol - we didn't care. With that concept in mind, we started work. I was working on a Solaris/NFS client (2.6, later 7 and then 8). Another person worked on a Windows/CIFS client (at the time people still yelled at you for saying "SMB" even though they've gone back to that terminology now). Several people were working on the DART server. One of my biggest frustrations with the whole project was the fact that NSG controlled the project throughout, with their own selfish "tech lead" constantly playing politics to push work from the over-resourced server group in Hopkinton to the under-resourced client group in Cambridge.
The way MPFS worked (at least on Solaris) was that the system would mostly use the built-in NFS client, except for reads and writes. For those, the client would get locations of file blocks from the server using a protocol we developed called FMP (File Mapping Protocol), and then use that information to transfer data directly to/from storage using the separate SAN. Along with locations, clients would also get leases on data, so they could cache that information instead of having to re-fetch it with every request. As should be the case in any caching system, there were also provisions for invalidating cached information. These caches and leases, including various failure cases, were the focus for most of our design effort.
In the end, we did manage to turn the idea into a product. It initially went out with the name HighRoad, which was totally new and not well liked by anyone who had actually done the work. They changed it back to MPFS after I'd left. The product didn't do well. Partly this was because few customers needed the extra performance enough to buy extra Fibre Channel HBAs for clients and deal with the extra complexity of connecting them to a SAN. They were OK with the Celerra as it already was. Also, the EMC sales team wanted nothing to do with it, because the amount they personally would make from selling an MPFS license made it less worthwhile than continuing to focus on Symmetrix and PowerPath and so on. The low commissions were, of course, a political decision just as they had been for Symmetrix vs. Clariion.
The part of MPFS that lived on was the paper trail. That includes my first patent, on which I was named fifth for an idea the first author had opposed until I beat him over the head with facts. It also includes the pNFS block layout (RFC 5663) and some other parts of pNFS, which were directly based on the FMP spec which I had compiled during discussions at EMC. I still have the earlier drafts, with commentary color-coded to represent each participant. Maybe I'll post them some day. In those, you can see a lot of the ideas and terminology around "holes" and leases and "grace periods" being developed. Amusingly, somewhere in the process of standardization they managed to reintroduce some bugs that we had worked very hard to avoid or solve in OG FMP, so the protocol described in the RFC is not even correct.