The Redundancy Paradox
DevOps Engineers often strive to keep as much redundancy in the systems that they're deploying as feasibly possible. Often they'll use different means of keeping their systems redundant--for production systems, maybe they use the "blue-green" deployment methodology to ensure that a system is always online and functional... or maybe they use continuous integration and continuous delivery on their development systems to make it easier to add multiple machines. Whatever the case, the various production instances have the same settings and access to the same data so that the user can't tell the difference between the servers that are being utilized--instead, they experience a unified product or service.
This is all well and good whenever a large majority of the product or service is developed and maintained by one company or organization and when that product can handle exceptions that arise from the consumption of outside data. However, I've noticed that redundancy cannot be all-encompassing when a program is part of a larger whole (and that larger whole is not controlled by the same organization). The real-world example is blockchain technology--in my experience, specifically blockchain instances that utilize the EOSIO software variants. EOSIO node providers that have the luxury of maintaining multiple nodes for each individual instance (some for public API consumption, others for block production) generally have some form of redundant node setup which works well for them, but more often than not the configurations of the redundant nodes are almost identical. And this is as it should be... until a chain event (i.e. the upcoming EOS Mainnet 1.8 upgrade or an upgrade of an EOSIO sister chain) inadvertently crashes a misconfigured node. In such an event, all nodes with the same configuration crash, thus bypassing any redundancy that was in place as if it were never there at all.
Unfortunately, this phenomenon, the "Redundancy Paradox," (that is, the idea that "true redundancy requires similar configurations across multiple systems, and yet similar configurations can be a cause of universal redundant system failure") is an issue that can only be mitigated and not solved. "Solving" the issue implies that there is no sacrifice to be made without changing one the configurations of one of the nodes... however, a properly-motivated systems administrator would have theoretically made both configurations "perfect and up to code"--that is, a competent systems administrator would see to it that both configurations are similar because he or she would want said configuration to cause the software to operate at its most efficient. Changing the configurations would cause some sacrifice to be made, and thus any change would be a "mitigation."
I'm not suggesting that maintenance teams get rid of redundancy--as, in any other situation, i.e. upgrading, mitigating damages due to the failure of a single node, etc., redundancy generally works as it absolutely should. However, it should be observed as one might observe biodiversity--the failure of a node caused by some external event will likely indicate the failure of some similar node also affected by the same event due to a lack of diversity in its implementation.
Potential mitigations of this issue can come in at least two forms: a) ensuring that DevOps Engineers are available to battle Murphy and his Laws at any hour of the night, and b) properly motivating DevOps Engineers to develop a failover system--that is, the issue caused by the Redundancy Paradox is something that must be addressed from outside the system and reference frame. External software that detects node failures, corrects them, and relaunches the node in an automated fashion is invaluable in situations like these.
I suppose this "rambling" about Redundancy Paradox is less of a call to change redundant practices and more of a warning to ensure that DevOps Engineers and business personnel don't become complacent when computing machinery is operating nominally--because external events are not bound to follow the procedures of any organization that works with them, and "nominally-operating machinery" should be inherently suspicious.