Never 'ssh' Into Production?
Yea or nay?
November 13, 2018I've been involved in an interesting discussion about enabling (or not) ssh on production machines, starting here.
OK, yeah, I get it, it's an anti-pattern. Something to avoid. I'm 100% on board with that. On the other hand, whether or not you can/should make an absolute prohibition depends a lot on what kind of system you're working on. To be clear, Charity didn't seem to mean it that way, but at least one person following up made that mistake. If I'm flaming, I'm not flaming her.
The context I'm talking about has to do with "app developers" part of the original tweet. Yeah, fine, app developers. Go make rules that work for you. But for the love of FSM stop imposing those rules on infra developers and especially storage developers. Your endlessly fluid no-data-inertia no-data-coupling world looks a lot like a vacation to me, and I'm frankly sick of the vacation crowd setting the rules for working folks.
(OK, yes, I'm definitely flaming.)
Let me give an example of when I think being able to ssh into a system is absolutely essential. There was a bug in Gluster (gasp!) that caused some internal pointers to be set incorrectly. The number of affected files was pretty small, but the effect was pretty large. We needed a quick fix. The "immutable infrastructure" answer of re-provisioning systems doesn't apply here. The problem was not in the infrastructure itself at this point; it was in the data. Re-provisioning wouldn't have done shit to fix it, and would have added extra disruption. Somehow, we had to get in to fix those bad pointers. The fix wasn't terribly complicated, but it wasn't trivial either.
What were our choices? In a "never ssh" world, we could have fooled our configuration-management system into running some arbitrary bit of one-off code to apply the fixes. There are a few problems with that. One is that such code running with no human feedback is extremely dangerous. I've seen many problems caused or made worse by just that approach, as the code snippet runs amok in ways the developer never intended and makes a much bigger mess before it can be stopped. But here's the absolute killer: how would you even be able to write that code snippet without doing some interactive investigation (and perhaps experimentation) first? Answer: you wouldn't. Every single time I've developed such a fix, it was based on having poked at the system myself first. Writing and deploying a long series of snippets to fool a higher-level system into doing ssh's job is an intolerably slow and error-prone way of doing things.
Here's the kicker: that wasn't just one time. That's a pretty frequent need with Gluster or any other data-storage system deployed at scale. Almost by definition, bugs are things that slipped through the existing automation, and new automation can't be developed without hands-on preparation. Life must be so easy when your thread/process/function ending means the whole world ends with it, so there's never anything to clean up before you start again, but data storage isn't like that. "Somebody else's problem" is an even worse anti-pattern than ssh. The "somebodies" who tend data need to log in to production systems, even if the transients don't.