Witness Update - Acknowledgement of error and changes to @curie witness operation structure (April 2nd, 2018)
This witness update is an open acknowledgement of a major error made on the @curie witness and an outline of the changes in our witness operation structure that have been put in place to ensure there is not a repeat.
- On April 1st, 2018 an error was made by @curie witness operator @locikll while upgrading to a new witness server. The @curie witness began double producing and causing block collision.
- This event highlighted a failing in the witness monitoring protocols in place for the @curie witness. The initial error was a major mistake, and it was compounded by the fact it took some hours before the mistake was fixed and @curie witness recovered.
- We take this matter very seriously; operating a top 20 witness on the Steem blockchain is a serious responsibility and we openly acknowledge failing to live up to this responsibility. The following steps have been taken to ensure this does not happen again:
- @locikll has been replaced as primary witness operator. @markangeltrueman (UTC time zone) will take over as primary witness operator. Mark is Curie's lead developer for several ongoing development projects and in his full time job heads up a team of third line support engineers for a large payments company after spending 15 years as a software engineer.
- @locikll will continue as backup witness operator to provide additional monitoring and coverage as needed; primary and backup witness operator will coordinate schedules so at least one is providing coverage on every day. @locikll will continue to manage the RPC node and other non-witness infrastructure etc.
- A "four eyes" QA process will be added to any changes that are to be made to the witness node outside of emergency incidents. No changes to witness node config etc. will be made by either primary or backup witness operator without the other operator checking it first.
- Automated monitoring will be implemented to notify primary and backup witness operators via SMS when blocks are missed.
- A backup monitor position has been created with direct phone contact to the primary and backup witness operators. @carlgnash is assuming the duties of backup witness monitor; this gives Curie witness 24/7 human monitor coverage as @carlgnash is Pacific coast US time, with waking hours to overlap UTC night time. This is intended as a final stopgap to ensure the fastest possible response time at all hours.
For those new to Curie, please follow @curie, and join us on Discord: https://discord.gg/jQtWbfj
For all witness related queries, there is a dedicated channel on the Discord server: #witness-enquiries channel. Please address any @curie witness related questions or concerns in this channel.To learn more about Curie operations, please read the Curie Whitepaper at curiesteem.com
Follow @curie's votes to support the authors. Please consider following our trail and voting for curated authors. If you are a SteemAuto user, @curie is an available trail to follow.
Things happen. It's unfortunate, but that's life. You're making changes moving forward instead of just ignoring the problem like some may have done.
We can all sit here and criticize over what's done, but what's done is done and the important thing is what the community can expect for your team in the future, particularly as you still hold a vital technical role.
Not trying to brownnose, just my 2 cents having just seen this linked above my post in the chat.
Yes, agree with what you are saying here 100%. As part of the Curie community I can tell you that we all felt a timely response was important; and while it is unfortunate this happened in the first place, we have much more robust monitoring and QA protocols in place now.
The 4 eyes process is a really smart and simple idea that I think we'll borrow from you in the future as well.
One of the benefits of being a community witness is that you pretty much always have someone around to check any changes that you are making. A lot of the time, the changes don't need to be too technical anyway, and even if they are, in the face of having to explain them to someone non-technical, you will often pick up something that you missed.
Too bad these things happen.. Good that you have better monitoring in place. Do you also have a disaster recovery plan and monthly fail over tests?
I was better not to name people. Everyone can make a mistake and off course you can put other people on the job, but why put the name public here. For sure he did not do it on purpose.
It was my mistake, and my decision to move on to a more suitable role (time-wise). Transparency has always been a goal which we strive for at Curie, and when a mistake is made, the more information about the cause, the better for everyone. We're not trying to sweep it under the rug and are providing not only our supporters and community the explicit reasons for the error, but also providing a record for us to look back on and to keep in mind as we continue into the future. 🙂
Sometimes it happen, learn from past mistake and move on.
I appreciate the transparency. I hope things go well for you and everyone involved from here on out. Mistakes happen, large and small, but rarely are the end of the world. I know that because I'm still here. :) Good luck in your new role and I wish you and the rest the best.
Mistakes happen. I once turned off a whole payment platform accidentally (but through no real fault of my own and lack of process) That had wide-reaching repercussions that invoked a company-wide change in policy. As long as you react to a mistake in the correct way, it can only make you stronger.
What were the consequences of this mistake? Did it have any real life repercussions?
Yes, the Curie witness was (at the time) a top 20 witness and as such was regularly scheduled to produce blocks for the Steem blockchain. For the ~ 12 hours that it took for the Curie witness operator to respond, the Curie witness was still being automatically scheduled to produce blocks and was missing the blocks the entire time. Ultimately as blocks are scheduled ~ every 3 seconds, and the same witness is not scheduled to produce blocks consecutively, when a witness misses a block it just means a 3 second delay in transactions on the Steem blockchain. This is still a real life repercussion and a top witness should not be missing any blocks, let alone 12 hours straight.
To listen to the audio version of this article click on the play image.
Brought to you by @tts. If you find it useful please consider upvote this reply.
Sh*t happens~ ¯_(ツ)_/¯
Thanks for keeping us updated on it~
Thanks for the update..
Thank you for information. I will join the discord of @curie.
Conductor or a similar script will help automate the transition from Primary to Backup.
Glad to know it's all sorted out.
Cheers.
Thanks for commenting. We have scripts running to fail-over from primary to backup in the case of missed blocks or other issues. However, this was a small configuration mistake that cannot easily be accounted for with anything other than a human four eyes procedure (which we have implemented).
Surely you can't mix human and data.
Nice human four eyes addon to your process.
Congratulations to @markangeltrueman who is already the primary operator. This possibility is in order to improve the service. Thank you for the information