For many of the big names this quarter, like WhatsApp, HBO, and even the U.S. State Department, this wasn’t their first major system downtime incident. We hope they’re busy working on a new and improved IT disaster recovery plan. We wouldn’t want to include them here again next quarter.
The past few months have proved that anyone is susceptible to an IT outage, from casinos to TV shows to the U.S. State Department. Even some of the biggest brands in high tech, such as Marketo and GitHub, went down for reasons that could have been prevented. In Marketo’s case, the fix was as simple as renewing their domain name (ouch).
But the repercussions of these IT outages are not merely developers sitting around twiddling their thumbs. In Florida, after Hurricane Irma devastated countless homes and lives, the Florida Power & Light’s (FPL) website crash exacerbated the confusion in a very real way. Thousands of residents panicked in dark homes when they could not access the FPL’s website for updates on when their power may be restored, while the FPL asked the public to stop calling since it was overloading their system.
One lesson rings true for all of these IT outages: preparation is everything. Whether it’s for a hurricane or a Game of Thrones season premiere, IT departments must plan ahead and invest in a reliable disaster recovery plan so they can remain available during any unexpected disasters.
Related: Get best practices and essential tips for your Disaster Recovery evaluation process directly from companies who have recently changed their DR strategy in this recently published white paper.
1. Game of Thrones Demand So High It Caused Global Outaghttps://info.cloudendure.com/Evaluating-DR-Solutions.html?itm_source=blog_6-reasons-high-availability&itm_asset=DR-Evaluation-WP&itm_CTA=inlinees
When: July 17, 2017
Duration: 5 hours
What Happened: As one of the most popular TV shows on the air, one would think streaming services would have properly prepared for a massive turnout the night of Game of Thrones’ Season 7 premiere. Unfortunately for many die-hard fans, that wasn’t the case. The demand was so high that every legal streaming service around the globe with exclusive rights to the show — including HBO, Australia’s Foxtel Now, and India’s Hotstar — crashed. And it wasn’t the first time. HBO’s streaming service HBO Go has failed or been temperamental during all Game of Thrones premieres since 2014. Users across Australia bitterly attacked Foxtel on Twitter. Foxtel tried to defend themselves, explaining that the sudden 40 percent increase in traffic caused the outage. Their Identity Management System, which verifies customer entitlement to view content, usually handles 5,000 processes a day; the day of the premiere, it was hit with 70,000 transactions in a few hours.
HBO Go’s tech support dude right now pic.twitter.com/HfxQgmkBuW
— Yung Pumpkin (@MarkAgee) July 16, 2017
2. Marketo Outage Caused by Failure to Renew Its Domain
When: July 25, 2017
Duration: 2 days
What Happened: The marketing powerhouse that was acquired last year for $2 billion went offline for a pretty embarrassing reason: They forgot to renew their domain. Luckily for them, a good Samaritan stepped up and renewed it for them (see tweet below). Marketo faced a lot of heat on social media from marketing teams around the globe venting their frustration — and hilarious memes — on Twitter. To save face, CEO Steve Lucas apologized and tweeted out his personal email, inviting anyone with issues to contact him directly. When asked how the mistake happened, a Marketo spokesperson replied: “We renew thousands of domain name properties we own every year with precision, yet the auto renew process for registering our main domain, Marketo.com, failed.”
— Travis Prebble (@TravisPrebble) July 25, 2017
3. GitHub Goes Down — And Takes Developer Productivity With It
When: July 31, 2017
Duration: 1 hour
What Happened: For many developers, GitHub is the go-to platform for managing source code. What at first appeared to be a minor outage on July 31st, quickly elevated to a “major service outage.” Users were unable to check in new code and make pull requests, and the web interface was also affected. The cause of the outage was not made clear. While GitHub does experience occasional downtime — most notably in 2015 and 2012, when they were the victims of a DDoS attack — they are known for generally maintaining stable uptime.
— Jeff Pierce (@Th3Technomancer) July 31, 2017
4. State Department Suffers Worldwide Email Outage
When: August 18, 2017
Duration: Several hours
What Happened: When the State Department’s entire unclassified email system went down, some initially feared an external actor was at fault, especially because back in 2014, the department shut down its unclassified email system for what they said was “routine maintenance,” which turned out to be a cover story for a Russian hack attack. This time, spokespersons were quick to declare that the outage was not caused by any external action or interference, but rather was “a glitch.” Not everyone on Twitter was convinced. They also reminded the public that the department has other means of communications, including a classified email system and an unclassified instant messaging system.
.@StateDept: “Technical glitch” caused email outage; an internal matter.
| Uh huh. Sure.
— Nick (@9Joe9) August 18, 2017
5. WhatsApp DOWN – Chat App Not Working for Hundreds of Users Suffering Server Issues
When: August 31, 2017
Duration: A few hours
What Happened: For the second time this year, WhatsApp went down. Thousands across the UK and Europe, as well as other locations around the globe, were unable to send or receive texts, photos, documents, or videos. As during the last outage, WhatsApp users took to Twitter to bemoan the connectivity problems. A WhatsApp spokesperson acknowledged the outage, and said they were working to resolve the server issues. With over 1.2 billion users, WhatsApp is one of the most preferred messaging apps and a key business tool for many countries. The fact that so many companies rely on WhatsApp to keep business running smoothly makes these outages particularly problematic and costly.
— Hayley Livingstone (@Hley_Born_Jaded) August 31, 2017
6. FPL’s Website Crashes, Adds to Customer Confusion About Status of Power Outages
When: September 12, 2017
Duration: 1 day
What Happened: As if it wasn’t bad enough that millions of people were left powerless or homeless in the wake of Hurricane Irma — not to mention the 134 people killed by the storm — the Florida Power & Light website and app also crashed. The technical downtime added to the confusion, as Florida residents were left in the dark as to when their power may be restored. The website outage also led many to panic that they needed to somehow report their power outage, when in fact an FPL spokesman said they were aware of all outages and not to call them. The system went down as a result of the high volume of traffic during the hurricane.
Any chance you can get @FPL on the outage on its website for outage updates ? I give them slack on power lines but not servers
— dg (@bostondg) September 11, 2017
7. Slot Machines at Graton Resort and Casino Hit by Outage, Patrons Wait Hours for Payouts
When: September 16, 2017
Duration: 10 hours
What Happened: In a rather unexpected outage, players at the Graton Resort and Casino were unable to cash out their winnings at slot machines. The machines went down around 9 p.m., and casino employees scrambled to manually deliver payouts. Over 1,000 patrons waited for up to six hours for their winnings, with some giving up and going home without their money. A casino employee attributed the breakdown to a network error. One player who works in tech said that he doesn’t blame the casino for the network problem, but said they should have been more organized. According to him, employees were running around frantically while new people came in to play, unaware of the problem.
Only a few techs helping the “big payouts”. 🙄
— #RallyVerlander (@baseballbabe_8) September 16, 2017