We've been working on something simply amazing the past few weeks, but we wanted to post a witness update in the meantime.
The following is written by @garethnelsonuk, CTO of SteemPower.org
Hi everyone, as many of you are aware i'm the developer on SteemPower.org and also handle sysadmin duties for the backend there as well as Charlie's witness and seed nodes.
Some of you may have noticed some issues with growing pains on the @charlieshrem witness so we both agreed that it'd be wise to go into detail on some mistakes we've made and how you can avoid them.
Hopefully this will serve as a lesson for other witnesses and as an explanation for others who want to know what happened.
First of all, let's look at the infrastructure we had in place and what we've been moving to.
Old infrastructure and the first incident
The old infrastructure consisted of the following:
- An SSH jumpbox into the Amazon EC2 VPC with properly paranoid firewall settings
- A primary seed node hosted at Amazon EC2 in the California region
- A secondary seed node hosted by EC2 in the EU (Ireland) region
- A witness node and miner hosted at EC2 without a public IP, connecting to the 2 seed nodes
- Amazon Route 53 to handle DNS
- The SteemPower.org website hosted at linode on a small 8GB instance
With the exception of some tweaks on the miner (fixing the thread number issue and switching the ECDSA implementation), all steemd instances were stock, compiled from github and ran on top of Ubuntu for maximum compatiblity.
The EC2 instances had 16GB of RAM and 8GB of swap - this was all it was possible to run within budget due to the expense of running on EC2.
In order to keep things stable I configured linux cgroups to prioritise steemd in RAM, swapping out other processes first when the RAM inevitably got full. Sadly this was not enough and the Linux OOM killer struck causing the first downtime incident.
Lesson 1 here is to not put off failover configuration - I had been working so hard on features for SteemPower.org that failover testing kept being pushed off and when the OOM killer struck the failover node was not up to date on the blockchain, requiring a painful process of manually switching keys until all nodes were caught up.
On top of this, some of the servers were under so much load trying to swap processes like sshd into RAM that it was extremely difficult to perform administration tasks.
Thankfully the various monitoring systems worked fine and my phone alerted me very quickly to the issues. Since I had appropriate remote access setup I was able to monitor the situation constantly - ConnectBot for android is highly recommended here ;)
Another issue that was less obvious to most was the latency in our API on linode. Part of fixing this involved switching to heavy use of memcached and a websockets proxy that automatically routed requests to the most responsive steemd node - this helped keep the service up but a more fulltime solution was needed.
New infrastructure and the second incident
In order to have a bit of breathing room and in order to control the rising costs at EC2 Charlie purchased a dedicated server at hetzner.de - and quite a beast of a machine too.
This new machine (with 64GB of RAM on a quad-core i7 with hyperthreading) is quite simply awesome and currently serves as the primary witness node as well as other light duties.
In order to make it even faster, steemd actually runs entirely inside a ramdisk that is synched to disk at regular intervals. On top of that, a few other services run inside a ramdisk and the Linux zram module is used to provide a compressed swap device inside RAM (basically trading spare CPU for more RAM and delaying touching the harddrives).
Even if something does touch the harddrives (note the plural - RAID0, mmm), they're SSDs with RAID to provide some high read performance.
Put simply, we moved from struggling to keep things running at all on EC2 and overpaying for it, to paying less and getting a much much nicer machine.
Additionally, there's various processes running on both Linode and this new hetzner box which are used for R&D work for SteemPower.org (some cool features coming, watch this space).
Over time we've moved these processes to operate in a more async nonblocking manner - handing off tasks to background threads and caching everything that won't change in the rather generous amount of RAM available. This involved rewriting the WSGI handler for the web.py codebase behind the website to use greenthreads among other changes but was worth it. Static content is also served entirely from RAM and is only read from disk at startup.
What caused issues however was trying to make steemd do the heavy lifting...
Lesson 2 - leave steemd alone to do its thing, don't mess around with it, do your own processing in a seperate process.
In order to implement a new feature on SteemPower.org there was a need to locate custom_json operations in the blockchain involving various configuration changes.
Lesson 3 - steemd does not behave like a normal UNIX daemon - do not send it HUP signals, it will die
Because i'm so used to sending a quick HUP to a daemon to reload configuration I thought this would be an easy way to load a new plugin into steemd without having to take it down and allow the nice low latency access enabled by having the external process doing stuff with blockchain data on the same machine as the witness node.
Naturally, I learnt quickly this is flawed. I quickly realised my mistake and restarted steemd only to realise to my horror the fourth lesson:
Lesson 4 - steemd LOVES to replay the blockchain slowly
Even after a clean shutdown, sometimes steemd will simply not start talking to the network and will not do so until you kill it and replay the blockchain.
This would not be an issue were it not for the fact that the failover node over at EC2 was running an older version and therefore could not produce blocks.
Lesson 5 - check your failover configuration
Although I make an effort to perform routine healthchecks on all servers i'm responsible for, one thing I neglected to check was ensuring a failover would actually work - as it happens I checked the failover node was running, but I did not realise it was an older version.
Moving forward - the new plan
To prevent these issues moving forward, the new plan is this:
- Perform a snapshot and copy of the primary witness node, boot it up with a different key (to avoid double producing blocks) in a different location
- Test the secondary witness node is able to failover properly if required - create a checklist for the health checks and automate as many checks as possible
- Perform manual checks of any automated systems - if there's a configuration fault, it should never be left alone
- Use only the seed nodes for API access and R&D work
Hopefully all of this will be worth it when we release the new features on SteemPower.org, as the software improves future downtime incidents will become less
and less likely.
Help keep SteemPower running! Voting for us as witness pays for the development of apps and tools for Steem.
Vote for us as a witness the following way:
https://steemit.com/~witnesses click the arrow next to "charlieshrem"