Do forks with fallback - my HF21 wish

in #steem6 years ago (edited)

Firstly, I have to thank all the coders and witnesses who have been instrumental to HF20, previous hard-forks and non-consensus codebases. At times of stress we need to maintain an appreciation of the work everyone does and continues to do, even as confusion and disappointment is rampant.

We need a fallback if hardforks don't go to plan

^^^ That's the TLDR; of it ^^^

Here's what should have happened this time around.

The happy path

hf20-1.png

HF20 is adopted, everything goes well and we move onward, towards HF21 eventually (SMTs are still on the way...)

The core problem here is that it assumes that everything is going to work out. Developers call this the "happy path", the way that things go when all assumptions are met, code works and is bug free, and maybe most importantly - users behave in the predictable way you, well, predict.

What actually happened

hf20-2.png

It turned out that the Resource Credits system plunged everyone (or nearly everyone) into massive negative balances, as well as some other issues. The network was unusable. Patches were made on the live network, which means that code was written to fix things and then deployed to production servers. That's jargon for servers which are actively serving the public right now.

To all involved, I know that if you spot the problem and can think of a fix, it's very tempting to continue with what you have and just patch it. Sometimes it's the most pragmatic thing to do on balance. There is a rather large cost in every way to going back: infrastructure resources, time to users, PR for not having gone forward (can be a cardinal sin in some circles), etc. I've done it myself!

However here's what should have happened, in my humble, not completely informed, opinion:

A pre-prepared fallback patch is adopted

hf20-3.png

The problem with "going back" to the last code revision before HF20 is adopted (this might be best put as reverting the code) is that while the buggy HF20 code was in operation, blocks have been written to the chain. So if you go back those blocks will not make sense to the old code, causing another chain halt. Your options are to:

  1. Go back, not only to the pre-HF20 code but also to the last block created by pre-HF20 code, or
  2. Write a patch which will make as much of HF20 blocks "legal" (at least not invalid) but with the same functionality as pre-HF20 code

Option 1 is almost certainly out as most people will dislike the idea of rewriting history - as they should! Important transactions could have been made there which will not now be made, or made based on information which is now known by all parties but which was not before, etc. etc.

Option 2 is the one presented in the diagram. And in order for this not to be a crazy scramble it needs to have been prepared in advance. That said it could be done now, but just with almost the same uncertainty as patching HF20 in general.

But where do we go from there? We need to keep staging things this way:

hf20-4.png

This is the contented path, with additional fallbacks. If the patches then work out, we can have HF20 adopted with minimal outages to many (or most!) users, and even rethink HF20 if necessary. Perhaps what we learned in the first round of errors brings up new information we didn't know before. We need time to react to that and be open to changing path if required to.

Then we can go forward to HF21 with out thinking as we did at the start (see first image) that there is no uncertainty, and without the huge uncertainty we now face, but with a reduced and acknowledged level of uncertainty.

Backing savvy witnesses

As others have stated here, here and here, we need to support witnesses who get this kind of thinking, who default to "go back and fix" instead of "keep patching and fix" on the live network. We need to support witnesses who perhaps even sponsor the testing of hardforks beyond the level we're doing now.

But no amount of testing can completely prepare you. We could use a fallback process such as the one I describe here to be the plan B I really wish we had right now.

Final thought. I'm not tied to this particular solution as such, it was just an idea I had and point I wanted to make noise about. Whatever averts massive outages such as this, and keeps low SP users in the game without days of "balancing" outage, we need that.

Sort:  

You overlooked that we had to do exactly a "fallback" when we were still on 19.2 and some 20.2 boxes inadvertently started running code prematurely just a week before causing an unintended early fork.

Which in turn caused all of us who had successfully prepared for this fork to have to roll back our already updated, and properly prepared boxes to 19.2 again. (Those of us with some wisdom and experience had our backups in reserve on 19.2, so we only needed to swap and reset to the last correct headblock checkpoint, but it was still a clusterfuck of bugs and errors that led to a two day outage.)

Your post isn't wrong, but the problem remains that the very creators of the code that has failed us twice now, cannot even explain what to expect it to do, have not documented anything in any clear and accessible way, and often do not even comment the places they change in the code, that in turn they cannot explain or predict to others. But sure, tell the "witnesses" to "read the code" - easy for arm chair observers to say. Impossible in practice in this context.

Then we dont test. Clones of our chain and front ends are popping up all over like weku, and smoke io and others. But for some reason, even though those small time outfits can clone our entire infrastructure ,we cant seem to have a test net that doesn't apparently need "frequent restarts and wasn't running mixed version nodes" according to one insider developer on that team I've spoken to, but won't indict here since he is new to that team and I doubt its his fault.

So yes, roll back is a thing. But WAY before that, lack of basic coding standards, peer review BEFORE dev commits, documentation, testing, experience and competence are bigger things.

So yes, roll back is a thing. But WAY before that, lack of basic coding standards, peer review BEFORE dev commits, documentation, testing, experience and competence are bigger things.

Yep. Rollback should 'not be seen as an option' in testing and prep but, should be available if absolutely necessary, so it doesn't become a crutch.

There is no way to prevent bugs in production systems. If there was, cars and planes would never crash. That said, we definitely don't have safeguards in place here to mitigate them to the levels we should have them in place.

Fundamental errors were made here. Fundamental checks were not done. Fundamental understanding of the changes being made en masse here was not achieved. Fundamental documentation was not provided PRIOR to the testing and release phases. Fundamental testing environments did NOT exist in a proper fashion for anyone to use, least of all the average witness.

Someone will definitely try to refute this in a follow up comment here, Id bet on it, and they will be taking advantage of double speak to try and throw shade and fool the public to discredit those who stand up with this claim because we don't have to worry about losing our income by falling out of the top 20 or our stinc employment, trust, fam. Cui Bono - who benefits?

Roll backs are an emergency exit. Before you leap through them, 100 other things should have been done before the plane left the ground.

"The witnesses are at fault, and should have read the code" is all at once, a truth, and a gross red herring at the same time. For reasons the average non-technical reader would even understand if we found the right simple metaphors to explain them.

Maybe this metaphor will work. The plane crashes. The engineers made a change to it that they did not create proper tests for, didn't document well, and did not publish about in advance. They had no one check their work before bolting it in the plane and when queried, said, well we can't really articulate it in english, but fly the plane awhile and we'll see how it works out.

After the crash they says, well, it's up to the fuel guys, the ground crew ,the flight attendants and the pilot and co-pilot job to make sure our random undocumented changes worked, right?

And the public doesn't know they are wrong, so they get away with such remarks.

And that's why there's a lot of FUD and pissed off people pointing fingers right now.

But it all comes back down to the fail of the engineers who set up the failure. And those who allowed it to become this way.

I broadly agree with you but I have to say no to this:

Roll backs are an emergency exit. Before you leap through them, 100 other things should have been done before the plane left the ground.

No way. It's not the first thing but patch after patch to a live system - no. We're talking vaguely here, your 100 things might be 5 of mine, but as it sounds no. You've got to know when to say, it's not actually ready, let users come first and let take our time to get it right.

Ask yourself again, what's the rush for HF20? The sun will rise again.

This reply confused me, its like you are stating you want to disagree but then you sort of go ahead and agree? No snark, you lost me here.

Haha! Okay I see that. What I'm saying is yes, for emergencies but it's theoretically always available, even now, but it is unthinkable to many witnesses. So no, not the 100th thing you try, the 5th thing.

The difference of attitude I'm talking about is perhaps more important. Repeat after me: We can go back. No one believes that.

Thanks. Like I'm saying elsewhere it's both, it's not either / or. What happened last week isn't what I'm taking about. In the diagrams you see a for real plan B as fallback. We haven't ever had that.

Yes too testing, one hundred times yes. I had an idea before about changes as a matter of course in a defined time period, say 1 month. After that the test coins (TESTS I believe is the convention) are worth something on the main chain, at least something. The idea was not taken seriously but this may be the time to advance it again, or at least start looking outside the box at such solutions.

Hear! hear!

rollback will not be possible if there are changes to the blockchain (database schema) and it will require complete replay.

Not entirely. We did it just last week, and only had to go back to the last good headblock.

  1. I was not able to reply because of lack of "MANA"

Not entirely. We did it just last week, and only had to go back to the last good headblock.

I was not around, but from what I understand it was from a minor version to another minor version. It was a Soft fork and not a hardfork which often includes changes to the "consensus" logic and also to the format/schema in which blockchain snapshots are stored. Part of the blockchain state from many plugins are now stored to rocksdb. So a restart is possible without replay in cases where schema and the consensus is not changed.

I believe eventually STEEM is heading to a model where only the consensus related data will be on the immutable blockchain and rest will be in various databases.

Also, I need to elaborate on the "roll back" - generally roll back means going back to the earlier state. So what I meant to say is that is complete re-index will be needed if there are changes to the consensus state. Roll back will be against the "immutable nature" of the blockchains. TheDAO attack on the Etherium chain is probably the best example where the immutability was not touched and forks were brought into fix the issues (with the smart contract) : https://ethereum.github.io/blog/2016/06/17/critical-update-re-dao-vulnerability/

go back to the last good headblock.

I am not sure how this was done - blocks after the last-good-head-block was ignored ?

You aren't wrong, at all about any of your assessment, past or present.

Except, there was a fork from mixed node versions, leading to a split chain, aka an actual unintended fork. We DID roll back to a checkpoint block and restarted the chain, quickly, and lost transactions (reversed,as if they never happened) so if we do it fast enough (too late now because way too much to undo), it is not entirely impossible.

oh, I was not aware of this - interesting scenario. Thank you for explaining.... This sounds like a classic Byzantine generals scenario.

Careful, we might sound smart and able to code and NEVER make the top 20...

Oh, everything I said above was bluffing ... There are infinite parallel blockchains and infinite number of top 20s ... as people from this chain and has done more hard freezes, i mean forks than every other blockchain in the known universe of blockchains put together, the immutable genesis blocks of all the chains will bless us with infinite amounts of mana ... the super intellectual state machines using probabilistic methods to maintain inter galactic consensus will help us with intelligence to even understand the meaning of 42 .... believe in Satoshi .. don't fear .. Amen! Aham Brahamasmi.

When I am not bluffing I speak like this. Will this help ?

PS: 42 is the "Answer to the Ultimate Question of Life, the Universe, and Everything" in The Hitchhiker's Guide to the Galaxy books.

I understand that the witnesses are auditing the code themselves but perhaps there should be some professional auditors that independently do that part of the job as well as professional testers on the testnet. In my limited experience testing for Nokia, there was a reason they outsourced it to us and it wasn't price.

There is a certain confidence blindness in coders as well as when it is a tight knit group, a certain amount of social consensus even though there might be misgivings from individuals. This also could highlight a need to spread the top 20 to the top 20 core and the next 20 with all responsible for audit.

It would be more expensive but less so than continually being forced to rollback because of oversights.

I would hope our witnesses would be professionals, after all, it is a paying job, by that definition they are professionals our should conduct themselves as such.

there is a difference between professional coder and code auditor though and often a difference in the way they look at the code. Having it independent also means that they aren't coloured by social dynamics or any particular outcomes. It is their job to find errors, not make sure it works and that often takes a different set of eyes.

Interesting. Do you know if other blockchains use this "service".

No idea but auditing code is common practice in most tech industries (it is a boring job) as it is like looking for spelling mistakes in a text. The testing I did was localisation for languages and it doubled as test service that ran specific test cases to look for errors.

Here, it might not even have to be as formal but I wonder how many witnesses fully audited it considering it had so many massive issues and then, in a couple weeks, how much of the testnet could be thoroughly tested. From my limited understanding, there is a fair bit of complexity and a lot of things that can easily be overlooked so, having fresh 'less' biased eyes limits risk a bit further.

Yeah, I hear you. I've heard many witnesses suggest they don't even try to audit the code. It is frightening. :)

Quantstamp audits blockchain code for tokens. Presumably they have some clients. They might not be able to do it for STEEM since we aren't an ERC20, but where there is one, there may be others who are more broadly focused.

Interesting and interesting further discussion. I like this idea. The problem is in payments I think. Witnesses are incentivized to witness, but who incentivizes the testers?

Maybe Utopian could step forward to lead that, in collaboration with another group, even Stinc

It was a clusterfork and while bugs happen, this also happened because the top 20 is slightly too stale, too settled.

There are many issues in this HF and several which should already have been covered before testnet even.

It is indeed easy to bash but the process pre-testnet has been lacking. Testnet should be considered an actual release candidate and also should have specific guidances about what to test/validate.

Meanwhile biggest exchange wallets have been down for “scheduled maintenance” since last freeze.

Let’s not beat around the bush here - and this comes from someone who generally thinks Steemit Inc has the right vision: in any other company heads would roll over this release. Even more so since Velocity was in the works for more than a year. It would be whether bye Ned or bye Vanderberg.

Simple as.

Lessons need to be learned from this.

By everyone.

By Steemit Inc, by the governance, by the wider witness community, and by us voters.

Agreed, the stone throwing isn't helping and I see the witnesses as our fail-safe. But they weren't.

What about hostile code? Do we have any faith that anyone is checking?

I hope it doesn't look like I'm throwing stones. I'm trying to offer solutions to head off something like this next time.

No, I didn't think you were throwing stones.

Testnet should be considered an actual release candidate and also should have specific guidances about what to test/validate.

This is an important main step. I think @sircork is right in priorities, it comes before my idea here, but the idea here is relatively cheap so I see no reason to not integrate true safety.

You brought up exchanges. Can you fathom how many folks are hurting right now because of that? While I've said before it's unwise to rely on Steem for your bread and butter, it's fair to expect some degree of consistency and witnesses should do their best to maintain that. The exchanges are just reading the writing on the wall.

Here's another witness post from last week saying not to fork now, this one by drakos.

When I read it then, I had completely drank the cool-aid and was wondering why he was such a worry-wart. Bwahahaha!

@personz... or should I say The Devil! 😈 jk jk

Other than that, you nailed it. Needs to be a coherent rollback plan. I hit on a test plan idea a little on my post but apparently the testnet operates differently so that is kind of wonky.

Posted using Partiko Android

Can you go into some details on the wonkiness? Very curious.

Yes, I learned the below from @inertia's post.

A testnet doesn't have the same number of tokens as mainnet, so we have to adjust the actual tokens for alice and bob, yet maintain proportionality.

In the initial version of Tinman, this was accomplished by creating accounts with an account creation fee above the recommended fee. The fee was then automatically applied to the account as STEEM Power, on the testnet. It's a "fee shortcut" that allows us to avoid extra steps.

Posted using Partiko Android

Yes, what we need is actually git revert

I agree. Good to have a backup.

Posted using Partiko Android

As far as I know, there was a fallback solution.

Impact on User Experience
By measuring more of the critical resource types the blockchain will more accurately price operations in RCs, but that also means that as of right now, resources are not being accurately priced. So after the RC system goes live, the user experience will have to change and the new system will need time to reach a new equilibrium. Due to this uncertainty, we added a “fail safe” to the code that will enable witnesses to revert from the RC system back to the old bandwidth system if absolutely necessary."
source

It seems like it wasn't absolutely necessary.

I totally agree to support the guys who know what they does.

And sad part is I'm not sure it's still on equilibrium(?) phase but one of the HF20's goal was to enable more people to sign-up through dApps or high level stakeholders. However since it became freemium there is no other option than powering up the newly created account. This is my alt account with 3SP

Screen Shot 2018-09-27 at 16.03.55.png

It was always freemium. You've always had a cost to operate. That is literally the root of a dPOS system. The revised math here simply attempts to enforce it more accurately.

In the old days, you just had somebody giving you a handout to get you started, and look at the noise and spam that created. Your remark betrays that you do not fully understand what proof of stake means. Not a snarky jab, just a simple fact. It's well documented, beginning with the white paper and 1000s of posts since then. Please investigate, you'll find it quite enlightening.

Thanks for the information, sir. Yes I'm a bit ignorant about crypto and its terms like dpos. As a regular social network user I feel this freemium thing after HF20, when my interaction capabilities down to some numbers. For example we lose that great argument "How much did you earn from facebook so far". or "3 seconds transaction/s (but only once per day if you are new)"Facebook wouldn't mind me spamming, I agree it may help to solve spam but it shouldn't turn this place into where only the rich ones able to talk, I think we are still in equilibration phase, otherwise a newly created account doesn't have any option but powering up. Which kinda hurts the aim of the HF20. AFAIK it suppose to increase the sign-ups via dApps using RCs. However I don't know who would want to put money in day one.

Dang I shouldn't press enter so fast. Now I left with 13 comments with 1200 SP. I'm a little bit tired so I forgot to check what I wrote in the first place, I kinda repeated again. Sorry about that. In the meantime I was looking for games to play because I can't afford to be active enough on the platform.

Yes, the equilibrium phase will help you, in a few days time, but also, a person with less money is not entitled to have equal signage space in times square with a person who can afford to put a sign on every building, and thats probably okay, because that same town square DOES allow the poor person to at least stand there with a sign of his own making, and if his message is solid, someone will help him be seen.... If its just spam, he will not be annoying everyone with giant blinking signs. And that's okay.

I don't think 'equal signage space' is the goal. If new users can't post, comment, and vote, they'll leave.

Whales can post dictionaries. New users don't need that, but they need to be able to engage effectively for Steem to survive.

Bandwidth - the ability to speak - shouldn't be a barrier to nominal engagement. If the result of HF20 is such a barrier, we're about to see the definition of 'death spiral'.

I don't think you understood my comment accurately. But no matter, neither of us can afford this comment thread till the patch and it will all be different then anyway. :D

Can i See my current resource token somewhere?

steemd.com, top left

[edit], well you could yesterday, looks like it's being updated and that feature is now gone again

Coin Marketplace

STEEM 0.19
TRX 0.14
JST 0.030
BTC 60268.51
ETH 3201.96
USDT 1.00
SBD 2.43