I believe — in difficult times, one shows what they’re made of. How a person handles crises defines his character; how a company handles crises earns their credibility; and how a team handles crises determines their future.
On Thursday morning, one of our replica database clusters failed over from master to backup. This is most likely due to an unexpected hardware failure. 27 minutes later, the backup database cluster also had an issue and failed over again. Shortly after that, data inconsistency alarms were triggered. Upon investigation, it was confirmed there was indeed data corruption affecting 1.7% of users, and growing. The data was corrupted in a way that none of our existing tools could fix. New data generated based on top of this corrupted data would result in more corrupted data. To limit the spreading of data corruption, we had to do a full resync of the database clusters, as soon as possible. We estimated this would take several hours. It was a hard, but necessary decision. I gave the go ahead. We updated announcements and pushed it out on social media as quickly as we could, then started the resync.
The resync was complex due to our large architecture, with heavy optimizations for trading performance, and in-memory data structure rebuild. The full resync also involved a number of distinct phases, which made it hard to estimate the exact time it would take in total, until the last phase had started. It was only after we started the full resync did we discover that the sheer growth of data in the last month would prolong the time needed. Initial calculations puts the estimate at 10 hours. One of our team members threw up, literally. But the estimate was still off. As time progressed, the sync was slowing down, putting the estimate at 60+ hours. It was — not acceptable.
While the first resync was still progressing, two new methods were coded, tested to the extent possible, and under the time constraints we had — started in parallel. Both of them were indeed faster. One of them saved us a number of precious hours in the end. The process still took a long, long time. The short version is, the full team worked continuously for 34 hours, overcoming many difficulties, to bring the system back online as quickly as possible. And continued to work for many hours after trading was resumed.
There were a number of lessons learned (the hard way), and items were ticketed as post mortems for future implementation. They are being worked on, tested, and deployed as we speak, including nights and weekends. Some of them are architectural changes that require trading halts for deployment. We appreciate your understanding of this in the coming days. They will be quick.
Trouble Doesn’t Travel Alone (A Chinese Proverb)
While we were busy restoring our system, our support site hosted by ZenDesk became unreachable, limiting our ability to post announcements, at the precise time when we needed it the most. We believed this was due to a DDoS attack on their service. It was tricky to announce this, as the FUDers would certainly exploit the situation to create a lot more FUD. Having multiple issues / problems really creates doubt in people’s mind. We (always) choose to be front and center about it — just be direct.
Moreover, our home page, www.binance.com also came under heavy DDoS attack at the same time. This is a typical case of when you have a car with a broken window, all the burglars want a piece of it. Similar to people, how secure a system is, is often determined during crisis. Luckily, security has always been our top priority and the hackers wasted their time, effort and money to no avail.
We know how stressful it is for our users during a time like this. The best thing we could do while the engineers were working is to keep communications open. Throughout the incident, our teams maintained constant communication, with updates no less than every two hours. With ZenDesk down, we relied on Twitter and other social media channels. I sometimes complained about this, but now I appreciate there are so many social platforms out there. I only had the bandwidth to manage my own Twitter account. The team, including our Binance Angels (volunteers) managed the rest.
FUDer and Scammers
It was Christmas for FUDers. All kinds of conspiracy theories were dreamed up. Most of these guys were probably shorting the market at the same time. Well, it probably didn’t work out too well this time, as the market didn’t drop during our downtime and went up 20%+ right after our recovery.
There were many scammers. The latest fashion seems to be: use a profile photo from a well know person, create similar looking handle, then tweet a reply to all of our tweets saying “if you send me x ETH, we will send you 2x ETH back.” We must have reported 200+ of them. Twitter was quite quick to disable those accounts. However, it must have worked somehow because they keep spending their time doing it.
Financial Times (a smaller website than Coindesk now, by traffic analysis from Alexa) promptly carried an article about us, being down. Nevertheless, it was true. It also named us as the biggest exchange and brought a bit more of their traffic to us. Thanks!
The real helper was Mr. Mcafee, posting an obvious fake image about us being hacked. Everyone pitched in to help defend us. He united the community for us, and rallied such support, during a time when we needed it the most. Sometimes, things that look negative are actually positive. Looking at his previous posts, I now think he was completely innocent, but was just asking the wrong (or right, depending how you look at it) questions. I can understand he didn’t read my article about FUD, how innocent questions are the best carriers of FUD. Regardless, my thanks goes to Mr. Mcafee, I will buy him a drink if we ever meet.
A few people asked for free BNBs as compensation. Those are probably the same people who were still not satisfied after we have given a 70% discount on future trading fees. Well, it shows they don’t trade much anyway. We thought about going 0 fee for a while, but that would not work well for people who referred users to us and are earning a rebate on all trades made by their referrals. There is a balance to everything. We have made the best decision we could.
Other items during the week
The stars must have been in some interesting alignment this week. I sat through, in front of my computer, 2 earthquakes in Taiwan of magnitude 6.0 and 5.7. Our hearts go out to the victims and people affected in Hualien City.
I somehow got on the front cover of Forbes, something I had never dreamed of. There were two articles about me. Unfortunately, the main article about me also said cryptocurrency is a bubble. So I choose not to tweet it (Sorry Pam, I am sure you can understand). Luckily, there was a 2nd article, which I did tweet. I was also shocked to see myself at rank 3. Even with the limited data I have, I know for a fact I am not. Don’t ask me who they are, I respect people’s choice of privacy.
My twitter followers increased 3 folds, from 30k in 6 months to 90k in the last 2 days.
The timing of this incident definitely wasn’t good, with the Coincheck incident still looming, people were still quite edgy. But we can’t control the things we can’t control. We just have to make the best of the situation we are put in.
Our angels, tech, ops, marketing, helpdesk, and even HR teams did not sleep much. They worked closely together. There was no finger pointing — zero blame. Everyone just wanted to carry a bit more of the burden for each other. I did not ask anyone to work overtime (I never have). Everyone just did what was needed. It was surprisingly easy to organize internally. If the above sounds like fun, send your CV and location (city will do) to firstname.lastname@example.org. We are recruiting the best people for all positions, in an industry and company that has significant growth potential.
Community is Binance’s strength. We are so fortunate to have such a strong community behind us. I would like to sincerely thank all of our supporters. Binance is a still a young platform. We have a lot of room to improve and grow. We have a long list of improvements we need to make, and will quickly make them in the coming days. Unfortunately, some of them do require trading halts system upgrades.
We appreciate your understanding and support!