Steemit Retro: August & HF21/22

in steemit •  5 months ago 

Hello Steemians, it’s been a long couple of weeks which is precisely why it was so important that we hold an engineering retrospective while important events were fresh in our heads.

Retro Recap

For those who aren’t already aware, we perform monthly retrospectives during which we systematically reflect on how we function as a team with the goal of continuously improving our processes. We want Steemians to have as much insight into what we are doing as possible, so today we’d like to share with you a summary of what we discussed in our most recent retrospective which covered the past month. If you would like to see last month’s retrospective, go here.

All retros use the same format in the same sequence, starting with “what went well,” so if you just want to read about what we think we did wrong, you can feel free to skip to that section ;)

What went well?

  • We continued to make good progress on SMTs, remaining ahead of schedule
  • Most of the backend work in Hivemind for Communities was completed
  • Preparation for the front end development for Communities began
  • HF 21 occurred (certainly more about this later)
  • We released video interviews with the some of our engineers which were well received
  • Testing for HF21 was much better than HF20 (or any other previous hardfork) in that it unearthed a number of bugs that would have made hardforking even more difficult
  • Despite the difficulties associated with the hardfork, the community seemed less anxious about the temporary interruption of services. We believe this was because the changes were so heavily directed by the community, and because communications were so much more extensive leading up to the hardfork
  • The economic changes already appear to be having a positive impact on Steem
  • The proposal system seems to be inspiring users to come up with new ways to add value to Steem
  • Whether due to the changes included in the hardfork, or the intent behind those changes, it would appear that a non-trivial number of inactive users, including influential users, have become active once again
  • We feel that our relationship with the Witnesses has become more collaborative and improved generally. A consequence of this is that we are better able to work together to come up with solutions, form a consensus, and implement necessary changes. This enabled us all to respond to the delegation bug extremely rapidly by releasing HF22
  • Tests performed on our seed node (or “exchange node”) proved useful
  • MIRA in memory replays actually work on our account history config (as opposed to a full node) and are surprisingly fast
  • Communications on twitter and Steemit during the outages were better than they have been in the past

What could have gone better?

  • Communications can always be better, especially during a crisis
  • CI Issues for steemd caused longer build times
  • SPS API calls could be easier to work with. It would have been great to have a separate service that could handle the data on release day. Another option might be to handle a lot of this in client libraries
  • Overflow on what we thought were safe calculations were actually not - this led to a chain halt and problems with certain operations on chain.
  • For the purposes of improved debugging, newer code could have been wrapped in FC_CAPTURE_AND_RETHROW
  • The growth of the chain has resulted in reindex times taking a very long time
  • While in memory MIRA replays were surprisingly fast, migrating state to disk took much longer than expected, effectively neutralizing the unexpected benefit that could accrue from in memory MIRA replays
  • The challenges that have arisen out of hardforks has placed an abnormal, and unacceptable, burden on engineers. This is not only unfair to the engineers, but also leads to fear and anxiety about future hardforks. While Steem’s facility with respect to system upgrades is a feature we believe should be exploited, we must dedicate more effort to ensuring that this can be done in a way that sufficiently considers the psychological well being of not just engineers, but community members, stakeholders, users, exchanges and Witnesses.

Escalations

  • Tests should be instrumented to exercise integers with higher values that could possibly trigger overflow situations
  • Only saving state files dating back 5 days is insufficient as we are leading up to hardforks
  • We should consider setting up a system to archive historical state files for a very long time
  • @vandeberg and @gerbino need more fast local storage so that they can debug live nodes locally
  • Platform independent state files, which were already part of the SMT spec, would have dramatically reduced downtime
  • MIRA in memory replays should be further optimized
  • We need to profile reindexes and consider optimizing the business logic
  • MIRA itself could benefit from further optimizations
  • We should explore how we can optimize reindexes or engineer future releases so that reindexes are not needed
  • We need better testnet infrastructure. Tinman should be copying values that are as close to 1:1 to the mainnet as possible. Delegations should also be copied to the testnet
  • We must review SMT vesting calculations via tests and code inspection to ensure there is no overflow
  • We should separate production deployment code from the steemd repo to prevent requiring a rebuild for config/deployment changes
  • We should investigate whether a debug build for a seed node is capable of keeping up with the live chain to a degree that will be useable
  • We should consider on-call rotations for coverage to alleviate other team members
  • The blockchain team should take some time off as soon as they can, and consider planning on taking time off immediately prior to hardforks to be sufficiently rested in the event of a worst case scenario
  • We should explore ways to expose more of our engineers to steemd code, including those who do not work on the back end. One way to do this might be regular “brown-bags” led by @vandeberg

This was by far our longest, and most extensive retro yet, and for good reason. Few months have included such exciting developments, and such difficult circumstances. We remain extremely excited about how Communities and SMTs are progressing, and believe that the preparations for HF21 were better than ever. That is part of what makes the downtime as a result of HF21 so disappointing. That being said, we do feel that we’ve come out of this experience with priceless information that can help ensure that the SMT hardfork proceeds more smoothly.

Stay Tuned


This post is only intended to summarize the results of our recent retrospective. We will continue to think very deeply about HF21/HF22; what went wrong, and what we can do better next time. We look forward to communicating more about this soon, so be sure to follow @steemitblog for more information.

Thank you for keeping calm and Steeming On.

The Steemit Team

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!
Sort Order:  

This is really well done.

I'd like to see a communications plan revamped in escalations. 24hrs between tweets when down seems less than ideal.

Also, a lot of the ecosystem is dead even if nodes are up if they aren't Steemit nodes. This place is too centralized on your guys and that needs to change too. It's great that you're reliable for a year and people trust the service, but it's bad that things aren't working if you're not up and running.

Thanks for your hard work on this. This was rough, but still could have been much worse, and the down time is the price we pay for being able to upgrade the chain.

I'm not a widely active Twitter user myself, but the downtime did make me realise how large of a following Steem/Steemit has on there. As well as more activity during downtime, it would be great to see more communication in general.

Also, completely agree on the centralisation issue. Everything shouldn't crumble because Steemit Inc's stuff goes down. The more true decentralisation, the better.

Decentralization is good. What pushes Steem/blockchain into having such an incredible technological leap forward is our ability to create platforms that can/will improve how humans communicate with each other. Although, It's still very early days and experimentation with these new systems is compulsory.

Can we have a noobs guide to setting up a mira node?

I'd love to see that as well!

me too :-D

  ·  5 months ago Reveal Comment
  • Most of the backend work in Hivemind for Communities was completed

that's pretty cool

  • Preparation for the front end development for Communities began

that's huge!

i think so too :)

Thanks for continuing to put yourself out there and sharing the retrospective.

As I care about the future of Steem I am pleased with what I read here - being really judgy when I can't talk (with my poor Engish graces), but it could use a little word smithing..just things like using the word should, needs to be more detailed..why is it just a should? please assume I'm not real technical - I assume the 'shoulds' mean It's desirable and a lower priority but you will do or do you mean you'll just consider it more....if so when? I know your a pro team and have a concept of these things, but if you don't share ....I feel bit bad as can see such improvements coming through, but you need a bit more push from us users and lovers of steem. Pls continue to ignore the crappy complaints and keep taking the reasonable ones on board and you have my vote.
I really appreciate the commitment to continual improvement. We as a community also need to improve and help you...Pls help us to help you
If you give me a UAT test or if you want me to create one even, happy to if it helps, just ask. Even better post one with your announcement of heightened chance of problems change HF window.

From the top of my head without enough techo background my only other feedback/thoughts are:

If you have a HF, the backout plan should include another fast HF
...your change window should include UAT testers throughout the community and its ok to suggest something like 'during the first week of a HF, can the community pls report problems as we have heightened risk of outage and/or speedy HF fix'...something like this. It's our blockchain as well, let us be part of its future success and give you immediate and helpful feedback

You simply then just need one person collating all the UAT feedback and engaging with the community during the heightened risk of outage/problems one week change window (I think 1 week of people being more attentive and reporting issues and expecting an outage for a global blockchain in rare case of HF is reasonable - let the entire eco system be your UAT testers as we are all beneficiaries. It's also a great way to keep the two way contact up in a way we feel more useful to support

Cheers and keep Steeming on!

Any info on deposits being changed?
I went to transfer some steem from my binance account and its been rejected. Is there a new method that I am unaware of after the update?

Loading...

Excellent retrospective, love the transparency specifically under escalations; It is essential that they do get prioritized to avoid similar issues in upcoming HFs.

You forgot something. Communication and support for exchange nodes. I assume it is non-existent as it has been in previous forks and chain interruptions leaving exchanges without the ability to transfer STEEM in or out of the exchanges.

We are in constant communication with all active exchanges whenever required updates are necessary and are here to support them with anything that they may need. In general, most of them have been very quick to respond. We also took time earlier this year to update our exchange node setup guide and associated deployment scripts to prevent common issues. Further, we provide real time support for exchanges that are even in opposite timezones from us.

#newsteem on

That is the first time I've heard of this. Thank you very much for informing us on this. I've seen in the past exchanges taking 2 weeks up to 6 months to get their node back in operation which prevents their STEEM wallets transfer and receive STEEM. One exchange has never recovered from February of 2018. Witnesses can get their nodes back in operation in a matter of hours. Exchanges should be able as well provided that coordination is maintained. In fact, it should be easier for exchanges to keep their nodes running sas they don't need to keep track of social media info (only STEEM transfers). They should have a stripped down version of STEEM node software that is very easy to replay. Clock is ticking. How many days has it been since HF21/22 and exchanges don't have nodes operational? I appreciate all you do and your thoughtful response. I just want to know if we are going to be waiting additional hours, days, weeks, months, or years for exchanges to get back on line. I was reading a post a couple days ago by someone that just bought a bunch of STEEM. They were very excited to get in at this price and power up, but they then discovered that they cannot move it from the exchange so they are in a waiting game. Eventually that person is going to get pissed off. You know STEEM community can take advantage of a hard fork. People read about the news and want to join, but then realize that they can't transfer their newly purchased STEEM to their account and get discouraged. It is a real shame that this has happened regularly when there has been a disruption of the STEEM blockchain and especially during a pre-planned hard fork.

@justinw do you know if Binance or Bittrex have planned or disclosed ETAs for when they will make the upgrades necessary on their end? Keep us posted if you are provided any details. Thank you!

so are we any better off in the long-run ??

👍
~Smartsteem Curation Team

Personally, I will never delegate to voting bots, or knowingly accept support from bots. I am just a pure content deliverer, who is (admittedly) now posting a lot less energy-intensive stuff because my rewards have been further slashed.

With the price continuing to lag after a short-lived bounce, what can you point to as a true positive, stat-wise?

TIA.

Also, can we get a really detailed explanation of what HF21 did with regard to serial downvoting? I've got one (@bloom) that downvotes EVERY SINGLE POST I make, and I'd like to know if there will ever be any relief.

TIA.

@steemitblog,
Whatever happens at that crisis day, the team did a great job. And it improved our trust about the chain as well!
$trdo

Cheers~

Thanks! We appreciate the support!

Congratulations @theguruasia, you are successfuly trended the post that shared by @steemitblog!
@steemitblog got 6 TRDO & @theguruasia got 4 TRDO!

"Call TRDO, Your Comment Worth Something!"

To view or trade TRDO go to steem-engine.com
Join TRDO Discord Channel or Join TRDO Web Site

All things considered, Steemit team did a great job on the hard fork.

We should consider on-call rotations for coverage to alleviate other team members

We highly recommend this. While blockchain never sleeps, humans unfortunately needs to.

The blockchain team should take some time off as soon as they can..

Happy holidays, any chance we'll be seeing them lounging in Thailand come Steemfest? They really deserved it.

Great report and on-call rotations is a very good thing to look into thanks keep up the good work steem on

Thanks for the great work!
There must have been a few smoking heads; I hope you make holidays the priority it has to be!

Any idea when the Binance exchange will allow Steem transfers?

Screenshot_20190905-202454.png

I came here to ask the same thing. Binance, Bittrex, OpenLedger all confirmed locked, I think this is holding the price down as buyers then become impatient and dump unable to withdraw. wrote a bit about it on my blog. Will post ETAs if I am able to determine any, please let me know the same. Thanks.

https://steemit.com/steemit/@minerthreat/the-answer-to-the-steem-price-mystery-could-be-exchanges-with-wallets-in-maintenance-mode

That's quite alot tasks on Steemit.inc's plate. Glad to see what the team is working on to make things better.

I don't really recall a lot of Steemians on twitter during the down time. It seemed like very few people were talking about it in the first 24 hours in twitter. Felt almost like a regular website not a blockchain

Having our fund inaccessible even after the blockchain was up is unacceptable. 3 years and a half and Steem is still dependant on a single entity. That's just proof Steem is still not ready to onboard the masses, that's not the fault of the engineers. It's really the fault of the founder and CEO @ned

How long will it take approximately to fix the issues you uncovered in your retrospective?
Have you made a prioritization of these issues yet?
Will it take resources off the development of SMTs and Communities to fix the issues?

Please take 5 minutes every few hours to communicate in case of a crisis!!! The downtime was not a problem, but the communication was.

The search feature does not appear to be working. Anyone else having this problem?

Should be fixed shortly, thanks @jondoe

Good deal, thanks Justin. Do you guys have longer term plans to make steemit.com more sustainable going forward? Selling large chunks of steem every week to pay salaries is not going to last very long, though I am sure you guys are aware of that.

The economic changes already appear to be having a positive impact on Steem

can this be quantified?

I'm pretty new to Hivemind -- or participating in running a node for that matter.

I was looking at the git for it today, and it suggests the hardware requirements are:

Hardware:
* Focus on Postgres performance
* 2.5GB of memory for hive sync process
* 250GB storage for database

I see it says later 'good settings for a system w/ 16G memory'.

Is the storage requirement still ~250GB, and is 16G memory still adequate?


Side-question / clarification:

Setting up Hivemind (and sharing / making public) is considered adding/improving decentralization for applications (if the node is used)? Or does that require a witness/consensus-node?.... Or a combination of both?....

Or is that kinda sorta the same difference as bitcoins "nodes" and "miners" relationship?

I just came to read snarky comments.

But instead of doing that, you wrote one? ;)

nah..... just put a place holder to check back later for one.

Lol

I am available to go into the test-net next time and try those extreme variables.

Great Recap @steemitblog it seems you are eager to improve workflows on all major fields, thats awesome.
Keep up the good work and communication!

Posted using Partiko Android

Yup, we’re always trying to improve. We never claimed to be perfect, just that we’re doing things no one has ever done before and that no one else is doing. That inevitably comes with unique challenges. We appreciate everyone who sticks with us as we become better at improving the world’s most advanced Web 3.0 protocol.

How to post better topics .....

amazing job for sure! Keep up with the good work!

Keep up good work!

  ·  5 months ago Reveal Comment