Anti Patterns: Parallel Arrays
December 11, 2018I've worked on a lot of bad code in my life. Yes, some of it was code I wrote. Some of it was code that was bad the day it was first written. More often, it was code that became bad as it evolved to fix bugs, improve performance, add features, or otherwise adapt to a changing world. As code changes, technical debt accumulates. I know other people have written about patterns and anti-patterns, but I'm going to write about some that have seemed particularly prevalent in my own experience.
The "parallel structure" pattern is very common, very easy to get into, very burdensome to future programmers, and very hard to undo. I first became consciously aware of it when working with the AFR module in Gluster, but that doesn't mean I hadn't seen it before. AFR's job is to handle replication of data, so I'll use that model to illustrate the pattern. Imagine that you're working on code that deals with bringing N replicas of a particular block/file/whatever into sync with each other. Inside this code, you have the following data structures.
- An N-element array of locations for the copies (e.g. server names/IDs/addresses)
- An N-element array of states for the copies (e.g. OK, missing, corrupt)
- An N-element array of data buffers for data fetched from each copy
- An N-element array of states for the buffers
Are you starting to see the pattern here? Are you starting to see the problem? How do you comprehend the state of a single replica? Only by looking at data in multiple arrays. This often happens as code evolves within a single function, with each iteration adding a new array to satisfy a new need instead of changing the type of an existing array. That's more convenient when it's all a single function, but then that single function gets way too big so it's refactored into a little cluster of functions that each have some different subset of these arrays (plus a count) as parameters, and it's a total mess. It's like if you were boarding N families onto a cruise ship but, instead of keeping the families together you put all of the mothers in one line and the fathers in another line, and boys and girls into more separate lines. Then somehow each split-up family has to reach the same cabin/stateroom. And yes, it gets even crazier when you have to deal with single parents and same-sex parents and children who don't fit neatly into your boy/girl dichotomy. It's a totally stupid way to do things, but I've seen the equivalent of all this in many codebases.
Getting back to code, the problems with this structure are two-fold. One issue is what I call the "difficulty of traversal" problem, which probably deserves its own post because it occurs in other ways as well. If you have a piece of information X, how do you get to related piece of information Y? What if you only have the array of locations, and you discover that you also need the array of buffers? Add a parameter. What if you only have one of those locations, and you need the corresponding buffer? Add two parameters - the buffer array and the original index. Whoever broke up the original monster function probably didn't pass every array to every smaller function, because nobody likes huge parameter lists. Or maybe a new function was added, and took only the parameters it originally needed. Either way, as the code continues to change you're back to repeating the original proliferation of separate arrays, only in N functions instead of just one and in the parameter lists as well. Then different people start using different names for each array parameter in each function. The code gets more and more confusing. Changing code gets harder and harder.
Sooner or later, you'll have to atone for that original sin by doing what you should have done the very first time you needed two pieces of information about the same thing: create one array with multiple pieces of information in each element, instead of multiple arrays with one piece of information in each element. In C:
struct replica_info {
string server_id;
struct data_server server; /* even better */
enum replica_state rstate;
struct iobuf *iob;
enum buffer_state bstate;
/* anything new can go here */
};
struct replica_info allmyinfo[max_count];
Now you can pass the array of these structures to a lower-level function and never have to worry about adding more parameters as more information is added for each replica. Or you can pass just one of these structures and never even need to know what index it was in the original array. Either way, every function has the same information identified by the same names and can be freely modified without having to muck about with parameter lists every time needs change (as they inevitably will). It's a painful process when the code has already been sick for a long time, but it practically always ends up being worthwhile.
In conclusion: any time you find yourself adding an array (vector, list, whatever) of information that's the same size as an existing array in the same function, stop and consider whether you're taking that first step on the path to parallel-array madness that will afflict you and your coworkers for years to come.