One of the hottest topics surrounding the outages this quarter is not how the problem happened, but how the company handled it. And when companies showed true transparency in the process, as GitLab and Instapaper did, they were heaped with praise instead of criticism. These cases prove that a disaster can be turned around into an opportunity to strengthen — rather than harm — one’s reputation.
That being said, no disaster is always better than a well-handled disaster. To learn about how companies are using the cloud to implement affordable, enterprise-grade disaster recovery, check out this free white paper. And be sure to take away the practical tips from the lessons learned from these 7 incidents of system downtime below.
1. United Airlines grounds all flights due to computer outage
When: January 22
Duration: 2 hours
What Happened: A computer malfunction led to the grounding of all domestic flights for the third-largest U.S. airline. Airports across the country were forced to tweet out messages about the technical issues after receiving overwhelming numbers of complaints. While they did not share how many flights were affected, United did release a statement that they were working as quickly as possible to resolve the issue. Interestingly, international flights continued to operate on schedule during the outage.
They are saying international flights aren’t grounded. So put me on a flight to the Bahamas and I’ll parachute out over tampa. Miss my kids!
— larryboatright (@larryboatright) January 23, 2017
2. A startup with $25 million in funding is in crisis mode because an employee deleted the wrong files
When: January 31
Duration: 18 hours
What Happened: A GitLab employee error led to one of the most talked-about outages this quarter. In the effort to fix a slowdown on the site, a system administrator accidentally typed the command to delete the primary database. Because GitLab had to restore a 6-hour-old backup, that meant any data created in that six-hour window may have been permanently lost. While some companies could lose their users’ trust and reputation over an error like this, GitLab took the crisis as an opportunity to demonstrate their transparency. The strategy worked: Thousands of users took to Twitter to praise GitLab’s transparency in their handling of the disaster, both in real time and post-mortem. Many things were learned from this crisis, but these five lessons on disaster recovery are particularly valuable.
@gitlabstatus Extremely impressed with the level of transparency- good luck getting it cleaned up.
— Adam Caudill (@adamcaudill) February 1, 2017
— Carlos C Soto (@eclipxoide) February 1, 2017
3. Instapaper says it’s now fully restored after last week’s outage
When: February 9
Duration: 31 hours (returned in a limited capacity)
What Happened: Like GitLab, Instapaper is another company that earned praise far and wide for its transparency in handling a massive outage. In a detailed post-mortem, Instapaper shared the root cause of the crisis, how it could be prevented going forward, action items, reflections, accountability, and more. The root of the problem turned out to be a data failure caused by a 2 TB file size limit for RDS instances, which the team had no prior knowledge of. According to Instapaper, the issue itself was “both difficult to predict and prevent, and the nature of the outage is extremely rare and unlikely to recur.” The company learned many lessons from the outage, including that it didn’t have a disaster recovery plan in place for this type of scenario, and will now test its MySQL backups every month instead of every three months.
After reading a brutally honest account of a recent outage I moved to @instapaper. Any company prepared to be that frank deserves support.
— Anthony Malloy (@awmalloy) February 15, 2017
4. CD Baby is back online after glitchy weekend
When: February 16
Duration: 4 days
What Happened: The online music store with more than half a million clients and a catalog of more than seven million tracks was plagued by crashes after its database became corrupted during routine maintenance. No new music could be distributed to digital services during the system downtime, and users were unable to access their data and analytics. The company sent out updates via social media throughout the four-day crisis, assuring its customers that it was not hacked and all data remained safe. CEO Tracy Maddux wrote in a post: “As you might imagine, a database serving 500,000 clients worldwide and seven million tracks is pretty big, so the successful restore process took quite a while.” While he said the company is “working to make sure this won’t happen again,” he did not specify which safeguards are being put in place.
5. Facebook is down, sound the alarm
When: February 24
Duration: About 3 hours (but they had a bumpy 48 hours)
What Happened: Facebook users in Europe, the U.S., Brazil, and Australia couldn’t log into their accounts for the second time in a week. After forcing users off the page, Facebook wouldn’t allow users to log back in, giving the appearance that their account may be in use by another person. In the outage two days earlier, users could log in but had difficulty seeing the News Feed. Facebook quickly announced the outage on Twitter in an attempt to calm the masses. According to their statement, “an error designed to help prevent suspicious account access sent a small set of people to their account recovery flow unnecessarily.”
— Alberto Jordat, MPA (@ajordat) February 24, 2017
6. World’s most popular collaborative tool, Slack, is down, affecting thousands,
When: March 7
Duration: 9 minutes
What Happened: Frustrated Slack users around the world took to Twitter and Facebook to vent their frustration over a large but extremely short outage. During the emergency, users were unable to log in, use the mobile app, or access the website. The outage was caused by a code change that the Slack support team swiftly corrected, and got the tool up and running in less than 10 minutes.
Slack outage was so short that twitter couldn’t even freak out about it.
— Bren Briggs (@BrenBriggs) March 7, 2017
7. Square outage forces restaurants to turn customers away
When: March 16
Duration: 2 hours
What Happened: A two-hour outage during lunchtime cost restaurants, coffee shops, and food carts around the country thousands of dollars in lost profit, as they had to turn away customers without cash. In addition to angering business owners, the outage caused many to question their reliance on technology and software. The outage was caused by a change made to one of Square’s back-end systems that resulted in capacity issues. Although Square offers an “offline” mode, this temporary fix could not be used by people already logged out of the system. Users are demanding compensation for their losses, but any benefits Square decides to dole out could be very costly.
Accepting “other forms of payment” today (barters, vinyl, craft beers, mustache wax, etc)… who else is working with this @Square outage?!
— Flatlands Coffee (@FlatlandsCoffee) March 16, 2017