2018/07/28 - Incident Report - Witness + Failover Crash
I just wanted to take a few moments now that the smoke is clearing to highlight what happened and why my witness missed so many blocks today. It was a combination of events that caused both my primary witness to crash and the failover to not react accordingly.
TLDR - It was my fault and I'm taking steps to prevent this in the future.
My primary Witness Node experienced a segfault exception running 0.19.10 at around 7am this morning, stopping the server (exception available here). My failover script was using http://wallet.steem.ws, which is a full node that failed recently as well, but was in the process of a replay. The requests from wallet.steem.ws were being proxied out to a secondary set of servers while the primary replayed, to avoid any interruptions in service.
So why didn't the failover trigger an account update and swap to one of the other 2 backup producer nodes I had running? Because I was using http://wallet.steem.ws in my failover python script's configuration and not https - and the upstream providers I had fallen back on during my node's replay now required https. Not only was the script not running properly, but since it couldn't even initialize properly (by getting the current witness status), it wasn't even ringing alarms of any problems. So the script was just infinitely restarting and trying to reinitialize.
So - the takeaway and processes that were needed to prevent this again are:
- Rework the failover script to alert if the script can't even start and/or connect to any rpc node.
- Rework the failover script to run in multiple locations against multiple different rpc nodes.
- Rework the rpc upstream failover protocol/pool to force route http -> https traffic without errors.
- Ensure that all future monitoring scripts use https by default.
- Potentially force a 301 redirect on http to https traffic, permitted that doesn't mess with the RPC connections.
Worst part was this occurred about 0700 UTC - which is right around when I fell asleep for the night. With no alarms going off and no indication there was a problem (besides the slack/discord messages, which weren't alerts), it was able to go on nearly 7 hours, all while 2 other block production nodes stood idle, but ready, to assume their responsibilities.
Hopefully others can learn from my mistakes! Either that or they're going to end up having a rough Saturday like I did.