This article describes how the hardfork on April 30, 2016 changed the witness scheduling algorithm and why these changes were made.
As many of you may have been aware, there had been a few bugs in how the Steem blockchain counted votes for witnesses and how it selected witnesses. The main bug I will be discussing in this article is the bug in which the blockchain failed to select the correct timeshared witness as described in the Steem white paper.
I will first talk about how the timesharing witness (I also tend to refer to this as the runner witness) slot selection is supposed to work, and then briefly describe the technical details of the bug that was fixed in the April 30th hardfork.
I will then discuss the requirements on the witness selection algorithm imposed by the need to fix the bug, and how that led me to develop a new witness algorithm that is currently powering Steem today.
And finally, I will give an overview of how the new witness selection algorithm works today.
You can skip any of these sections if you wish (although a highly recommend not skipping the first section). But it would be best to just read the whole thing in order.
How witness scheduling was intended to work
From the very beginning, as you can tell by reading the white paper, it was intended for there to be three different paths to become a block producer (i.e. witness).
The first path is through mining. Users present the necessary proof-of-work to the blockchain which is their ticket to enter the mining queue. They wait their turn until they get to the top of the queue and then are allowed to produce a block. I call witnesses who became block producers via this path miner witnesses. After April 24, 2016, only one miner was selected to produce a block per round.
The second and third paths are through votes. Accounts vote using their Steem Power (SP), or the SP weight proxied to them, for other accounts campaigning to be witnesses. The blockchain then orders the accounts by votes (from highest to lowest). The distinction between the second and third path has to do with which voting rank the account has.
If the account is in the top 19 by vote, they become a block producer via the second path; I call witnesses who became block producers via this second path voted witnesses. The top 19 voted witnesses are included as block producers every round.
If the account is not in the top 19 by vote, they still have a chance to be selected as a block producer through a timesharing algorithm that is weighted by their votes. I call witnesses who became block producers via this third path runner witnesses ("runner" both because they are "runner-ups" but also because the timesharing algorithm can be visualized as witnesses running around a track at a speed proportional to their votes). It was intended to only select one runner witness per round which adds up to exactly 21 witnesses selected per each round.
The algorithm to select the runner witness is a variation of weighted fair queuing where the weight is the votes the witness has. The idea behind the algorithm is to select the runner witness as a block producer with a frequency (relative to other runner witness candidates) proportional to their votes (relative to other runner witness candidates). So if a witness is at rank 20 with twice the votes of the witness at rank 21, the rank 20 witness should be scheduled as a block producer twice as frequently than the rank 21 witness.
Personally, I like to think of the algorithm like the accounts participating in a virtual race in their own universe with time independent of the real universe. All accounts start at the start/finish line of a circular track at the same time, but some accounts are faster than others. Their speed is proportional to their votes.
At any instant in virtual time, we could predict precisely when (in virtual time) the next person will cross the start/finish line thus completing another lap in the race. You simply extrapolate out for each account based on their current position (this is called
virtual_position in the code) and current speed to find the time (this is called
virtual_scheduled_time in the code) at which they would complete a lap. Then you pick the account with the smallest
virtual_scheduled_time and that is the next account to complete a lap. Granted all of this can change if the votes (i.e. speed of the running witnesses) change. So the blockchain is responsible for keeping all of these variables (
virtual_scheduled_time, etc.) up to date as the votes for witnesses continuously changes.
So when it comes time to pick the next runner witness for block production, you can imagine that the blockchain effectively fast forwards time in this virtual universe until the next account completes the lap. That account is scheduled for block production for the next round, and their
virtual_position is reset to 0 so they can run another lap.
I won't go into too much technical detail on the bug(s) in this article. The main issue had to do with overflows when updating the virtual race variables and also how skipping over invalid runner witness candidates worked. These bugs led to interesting behavior. For one, the invariant that time only moves in the forward direction in the virtual universe was not maintained. Another interesting side effect was that when updating a runner witness that completed a lap, it was possible (if they had 0 votes) that an overflow would cause their
virtual_scheduled_time to decrease, which is something that should never happen (other than a global reset of everyone's time, i.e. restarting the virtual race). You may have noticed the symptoms of this which was an account getting stuck as the runner witness for several rounds in a row (e.g. hello and adm). In fact, the only reason they became unstuck on the live network was because someone voted for them!
The runner witness selection algorithm also did not handle miners properly. Things become complicated quickly when you need to consider three different paths (top 19, mining, runner) to becoming a block producer and all the ways they might interact with one another. As I described in the previous section, selecting the runner witness for the next round is "as simple" as picking the account with the lowest
virtual_scheduled_time. But what happens if that account happens to also be a top 19 witness? Or in the mining queue? The old algorithm effectively skipped over such accounts in two ways.
First, if the account was in the top 19 it simply changed its
virtual_scheduled_time to the largest possible value. However, this broke the consistency between
virtual_position and the votes (i.e. speed) of the account. Breaking this consistency could and did lead to strange behavior when compounded by other bugs, so it isn't the ideal way to handle things. Also, the
virtual_scheduled_time of a witness was only updated when votes for the witness changed or if the witness was scheduled for the next round. This meant that if a top 19 voted witness was pushed out of the top 19 (not by being voted out but rather someone else being voted higher), they would still be stuck with the old maximum value for
virtual_scheduled_time until their votes were updated. So such a witness might have never been scheduled as a runner witness if the votes for them hadn't changed.
Second, if the account was in the mining queue, the witness scheduling algorithm simply skipped over them (both for being considered as a top 19 voted witness and for being considered as a runner witness). This had two consequences.
The first consequence is one that some of you in the community may have learned the hard way. If you were in the top 19 and also used the same account to mine, you could end up financially hurting yourself. Once your account entered the mining queue, it would effectively be ignored as a top 19 voted witness and your spot would go to the witness in rank 20 instead. Since the mining queue is typically approximately 100 miners long, this meant you would lose out on 100 rounds worth of revenue (100 STEEM) in exchange for just 1 STEEM when you finally exited the mining queue 105 minutes later. Now this is not a bug in itself. After all, even today with this behavior changed the community recommends that top 19 witnesses don't also mine.
The second consequence however was a bug. Because the witness scheduling algorithm simply skipped over runner witness candidates that were in the mining queue without modifying their
virtual_scheduled_time, it violated the invariant that everyone's
virtual_scheduled_time should be no smaller than the current virtual time. It also meant that if the skipped-over witness did not produce a block as a mining witness on their turn, and after exiting the mining queue if their votes continued to not change until after the next time they were scheduled to produce a block as a runner witness, then it would be possible for the virtual time to go backwards as was mentioned earlier.
From the previous section, it should be clear that maintaining certain invariants (keep virtual race parameters consistent, make sure virtual time doesn't go backwards) is important to be able to sanely reason about the correctness of the witness scheduling algorithm. Also any time numbers needed to be added to the virtual race parameters, it is essential to be sure that no overflow occurs (or to check if it does and do something reasonable to deal with it without invalidating invariants).
With the help of myself and @abit, @dan went through the code and fixed many of these issues and changed the witness scheduling algorithm so that it did not violate the above invariants. Overflows were also checked for, and in the rare cases where they occur, the virtual race is just reset for everyone. However, the changed algorithm to fix the bug also meant a slight but significant change to the spirit of the timesharing algorithm for selecting a runner witness. First, the miner at the top of the queue was selected, then the witness with the lowest
virtual_scheduled_time time was selected (even if it also happened to be the same selected miner), and then finally 19 of the highest voted witnesses who were not either the selected miner or runner were selected to finish up the 21 active witnesses for the next round.
So besides the fact that it was possible for an account to potentially be scheduled for two slots in a single round, what did this change mean for the fairness of the timesharing algorithm? If the selected runner witness was also in the top 19 (something that would happen quite frequently because the top 19 witnesses have high votes and thus fast speed in the virtual race), it effectively meant the runner witness was the witness at rank 20. The white paper said that the runner witness slot each round is "timeshared
by every witness that didn’t make it into the top 19 proportional to their total votes" (emphasis mine). By not ignoring the top 19 witnesses, the changed algorithm would have disproportionately favored the witness at rank 20 as the runner witness.
So, I decided to just rewrite the witness scheduling algorithm from scratch to do it right. By doing it right I mean:
- First adding the top 19 voted witnesses before anyone else.
- Maintaining all the invariants properly.
- Not scheduling an account for more than one slot per round.
- Timesharing among the accounts not in the top 19 in a way proportional to their votes.
Continue reading on to the next section to learn how the new algorithm actually works.
The new witness scheduling algorithm
The new witness scheduling algorithm does the following each time it needs to select a new set of active witnesses for a round:
- First, select the top 19 voted witnesses as the voted witnesses and add them to the new active witnesses set for the round.
- Then, pop off as many miners from the top of the mining queue until enough (in Steem's case just 1) are found that are not one of the selected voted witnesses. Those selected miners are called the miner witnesses and they are added to the active witnesses set.
- Then, fast forward in virtual time in the virtual race universe until enough accounts have completed a lap who are not one of the selected voted witnesses nor the selected miner witnesses. Those lap completers who were not voted witnesses nor miner witnesses are called the runner witnesses and they are added to the active witnesses set. The word enough in this context means just the amount of runner witnesses needed to add to the active witnesses set so that the completed active witnesses set for the round includes exactly 21 witnesses.
The algorithm is general enough to work if any of the numbers above are changed. But for the Steem blockchain, the numbers are hardcoded to select the 19 highest voted witnesses, at most 1 miner witness, at least 1 runner witness, and always exactly 21 active witnesses per round total. Note that while this will never happen in reality, if everyone stopped mining for a long time, the mining queue would dwindle down until it is empty and then two runner witnesses would instead be selected per round to make up for the lack of miners. In the original algorithm, if mining stopped, the mining queue length would dwindle down to a floor of 21 at which point the lucky miner left at the top of queue would just remain at the top of the queue and be scheduled in every round.
This new algorithm also means that if you are in the mining queue you no longer need to worry about missing out on producing blocks as a runner witness. However, as unlikely as it is, you might still get skipped over as a miner if you end up at the top of the queue right before the exact same round in which you would be selected as a runner witness. So it is still recommended to mine with a separate account.
The new witness scheduling algorithm fixes the bugs we have faced with witness scheduling. It went live in the April 30th hardfork and seems to be working fine since. Hopefully there aren't any more bugs with the scheduling algorithm remaining.
The new algorithm stays true to the intent described in the original white paper even though there are subtle changes (improvements?) that go beyond just fixing the bugs of the original witness scheduling algorithm, such as: witnesses who are in the mining queue no longer need to worry about losing their runner witness slots while they wait in the queue; the algorithm wouldn't reward a single lucky miner if mining were to somehow stop (instead there would be two timeshared runner witnesses per round); and, the algorithm is robust enough to work even if the number of top N voted witnesses per round, number of miner witnesses per round, or number of runner witnesses per round were to be changed in a future hardfork.
Feel free to ask more questions about the new algorithm in the comments and I will try my best to answer them. If you like my work, please consider approving my witness if you haven't so already. Thanks.