Backseat HF20 Postmortems

in #steem6 years ago (edited)

In contrast to all the screaming going around, which I understand is a natural reaction to the way the change rolled out, I would like to add something constructive. Keep in mind that everything is easier to catch in hindsight. That does not excuse how poorly the change rolled out, but it is more constructive to identify some improvements that could help the process.

Random initial thoughts, before I dig into what problems that were observed and the ways they were fixed, and where the process could be improve.

  • Don't announce success immediately on flip ;)
  • Especially for newer features, some state monitoring is a must. I feel that this could have caught issues before they became realized. On the main net.
  • We do not know the full extent of tests that were done on the test net, so I would hold off judgement before we know more. Besides, there are many reasons why the test net may not accurate reflect real conditions, as much as they tried to do so.
  • Steemit will likely do a full postmortem with their knowledge which should hopefully clear up many gaps in my knowledge.

Massively Negative Mana

What happened? The switch flipped, and nobody could do a thing! It was quickly revealed that nearly everyone had resource credits that went to an extreme negative.

Immediately they identified the problem in what allowed the computations to become this way, but it required a costly resync of the nodes to recompute the mana correctly.

This is one of the worst types of bugs to happen, that cause a standstill and require a resync of nodes. Note that had the negative mana issue been caught, that only would fast forward to today's state where we patiently wait for RC costs to normalize, which may have been more palatable.

As I've understood this, the RC balances of each account could have been inspected on the main net before the cutover time, and that would have immediately pointed to a problem with the mana calculation. Of course monitoring all accounts is probably not feasible, but a random sampling would have sufficed. Please correct me if I am off base here

Small Note About Detection

There has been word that the negative mana issue was present in the test net, and people are all up in arms about it, but I would like to raise the counterpoint here: the team could not determine if it was an actual issue, or if it was due to the firehose of real data streaming into the testnet, which does not accurately account into pricing. So cut them just a little bit of slack here. There are plenty of reasons to dismiss a weird state in the test net. Perhaps this should have been a warning to check the values in the main net though. Again, hindsight is 20/20.

Voting Power

There was another bug where the voting mana did not preserve the existing voting power on the cutover. I'm not entirely sure if this could have been caught, but if the voting mana was also being computed along the way it may have been auditable.

There are many posts about the effects of this. In case you do not know already, after the one time penalty, voting power will behave as usual. Sucks to lose that amount, but there's no going back.

Delegation Behavior, or Powering Up

See this reported bug by @yabapmatt. This is something I also encountered while trying to help someone boost their RC, and was disappointed that it didn't help immediately (except in the regeneration rate).

This, while not as minor, should have been a simple test case on the test net. That said, it's easy to spot now that it was observed. Expected RC behaviors though should be clearly outlined somewhere and checked off in terms of "things we could have tested on the test net".

The same thing happens when powering up. Yes, the power up costed me some mana, but it also didn't increase my existing mana either. I saw a PR to address both issues.

RC Costs Being Too High

Yes, we knew that it would take time for the market to stabilize, but it seems that the equilibrium might not be working very well. We do expect this to be usable by accounts of all sizes, and indeed there's a change that adjusts parameters to account for this here. This seems to need a resync? Though I'm not sure.

Summary

All in all, the problems are being identified and addressed. I feel that there are a few sanity checks that could have caught the really bad stalling, but it is also a bit of speculation here. My post lists a few of the issues that I was tracking, but the main take-away is this:

  • Monitoring the state pre-flip for the new features would have caught the issue before it became a problem.

Of course the test net could also be better, and I'm sure the testing facilities will only improve over time. I cannot comment on how it can improve since I don't know the details of it. Other than I was able to make a test post in the test net just fine.

We will get it to a workable state. Just be patient, as the friction now I believe will pay off on the long run. Having a better system for bandwidth is a good system level change in my opinion, as those of us that experienced the bandwidth hammer know (and promptly forgot about).

Oops this isn't really a summary section at all. Ah whatever. And also keep in mind that I'm a back seat driver, so I could be full of shit too. This is just what I have the visibility to observe after all.

Sort:  

I would like to raise the counterpoint here: the team could not determine if it was an actual issue, or if it was due to the firehose of real data streaming into the testnet, which does not accurately account into pricing.

Valid, but if these reports are true, the team also could have used that data point as an inspiration for pre-emptively creating the patch that instead had to be deployed and replayed not only on witness nodes but on RPCs as well. In my view such an incident should have been a big red flag that pushing the changes as-is was simply a bad idea.

Which brings me to another point. For a good period of time, at least a few hours, the witness nodes were actually ready to allow us to transact (as they are capable of replaying faster), but transactions could not be broadcasted through RPC nodes because they first validate the transactions before passing them along to witnesses. (This is not first-hand information, and I have not verified its integrity, so don't take it as gospel.) If this is true, I think maybe it would be nice to make it possible (or more accurately, accessible) to broadcast a transaction you know is legit.

If we had that, the frontends including Steemit could have been updated quickly. API nodes might have been even more confused about resource mana, but maybe we could have gotten everything going again more quickly.


iou.png

I like your post.
I can't vote it right now.
If I did that you'd get half of what it's worth,
and you're worth more than that.
Be back later.


Yes I don't disagree with that.

I'm not too certain about the configuration concerning RPC nodes. But I think someone set up such a node, though it actually allowed transactions to go through that made things very negative (see other comment about netuoso). That allowance did get patched so it wouldn't happen in such a configuration again.

It's been a painful learning process for me. I am still waiting for my voting power to be at 100%. I have been so careful in the past to not go below 75% so that my vote, no matter how small, counts. Now, like everyone else, I must wait DAYS for it to normalize. sigh

I do not really get the gaming (I hope I am using the right term) aspect of the voting system and the points given and received if the upvote is at the right time. I just like to interact with folks and hopefully have as real of an interaction as one can on-line.

Thanks for your info on this HF20 @eonwarped. I still miss the "eye" icon that at least let us know how many people looked at our posts.

I wouldn't really worry about timing to be honest. The amount of curation you generate is not likely to matter much. But if you want as much to go to the author as possible. 15 minutes or later is the deal.

Thanks @eonwarped! Good to know. :-)

Count on me for dumb questions! So where do I find this mana number and what mana number do I want? And how much mana does an upvote cost? Comment? Post? - is it obvious yet that I am. Or understanding this mana thing?

We’ll save RC for once my brain understands mana 😜.

Here you go! http://steemd.com/@tamala

Doesn't estimate post cost though. As far as Mana goes, your RC Mana bar tells you how many RCs you have. And all actions you can do are charged a certain amount of RCs. (Resource Credits)

@eonwarped, In my opinion, this situation is really unfortunate and this is really unfortunate because, everyone hold the excitement towards Hardfork but many Steemians are failed to access the Blockchain and inturn that gave an disappointing essence to many and also now we can see an Unbalanced state in the Steem Economy. Let's hope that this situation will be get resolved soon.

Wishing you an great day and stay blessed. 🙂

Seems like a reasonable proposal to me! We can definitely improve our processes. Some of that will definitely take time from my understanding, but they make sense to gravitate towards. Thanks for bringing it to my attention.

Is there a fix for Netuoso's witness? It's currently running and he won't be able to change settings or unsign until 2032, which seems... bad.

On the other hand, at least we'll be able to vote him into the top 20 for a guaranteed no vote on HF21.

I did see that behavior actually, I was going to mention this one but figured it was very specific. He did a funny thing in claiming lots of accounts, using a node that had the 'fail-safe' to revert to bandwidth... and there was a cost to those actions that put him in that state. I suspect they need to work out a special case for him, though I'm not sure exactly what :). Whatever it is, probably needs a resync.

(In case you're curious, it's here)

Yeah, it definitely looks like it was his fault, but that doesn't excuse having a witness locked in a cul-de-sac like that, where its only options are to continue using the settings before the RCs were used or power down and miss blocks because it's still signed.

At the very least unsigning witnesses needs to be patched to be RC-free, and probably all witness settings.

Agreed on that. I do not know what they plan to do for it though.

Oh boy:

Now that we know that RC basically works on mainnet and we're pretty sure we won't need to switch back to the old bandwidth algorithm in a hurry, we can remove some of the contingency options which allowed nodes to easily switch between the old and new bandwidth algorithms

And that's complete. The Steemit Inc. devs have already taken the time to kill the only functionality that can stop this madeness.

Yes indeed. However, also note https://github.com/steemit/steem/pull/2981 which should address what you mentioned?

If resource use is actually deflationary this might help.

My impression is that it will continue to be inflationary and this is equivalent to a fiat currency solving its problem by printing bills with higher numbers on them.

That's not a fixed budget per block, in the case you were thinking that. It's how much is added per block, so that the pool is inflationary, and with a per block % decay on unused resource credits.

I guess you could also be thinking that per block activity will be growing over time, but I think setting the starting budgets correctly should still be able to correct for it. But this is without knowing more details.

do they have a bug bountry program for freelancer testing experts to find high severity bugs and get substantially rewarded for the same ? after all " your voice is worth something " is out motto !!

I think I heard this idea being tossed around, and it makes sense. Some rumblings of Utopian doing so perhaps? Forgot where I read it.

Congratulations! Your post has been selected as a daily Steemit truffle! It is listed on rank 10 of all contributions awarded today. You can find the TOP DAILY TRUFFLE PICKS HERE.

I upvoted your contribution because to my mind your post is at least 4 SBD worth and should receive 78 votes. It's now up to the lovely Steemit community to make this come true.

I am TrufflePig, an Artificial Intelligence Bot that helps minnows and content curators using Machine Learning. If you are curious how I select content, you can find an explanation here!

Have a nice day and sincerely yours,
trufflepig
TrufflePig

Hi @eonwarped!

Your post was upvoted by @steem-ua, new Steem dApp, using UserAuthority for algorithmic post curation!
Your UA account score is currently 4.527 which ranks you at #1824 across all Steem accounts.
Your rank has not changed in the last three days.

In our last Algorithmic Curation Round, consisting of 726 contributions, your post is ranked at #79.

Evaluation of your UA score:
  • Some people are already following you, keep going!
  • The readers like your work!
  • Good user engagement!

Feel free to join our @steem-ua Discord server

Coin Marketplace

STEEM 0.19
TRX 0.14
JST 0.030
BTC 59479.71
ETH 3174.48
USDT 1.00
SBD 2.44