5 Things They Never Told You About Downtime

Posted by

If the service that you provide to your users is interrupted or unavailable—it’s down. Here are five things that you need to know about managing downtime in the cloud.

Cloud providers are generously publishing aggressive Service Level Agreements (SLA), along with promises of elasticity, scalability and success stories from fortune 500 companies. Unfortunately, it’s very easy to fall into a false sense of security when it comes to service availability in the cloud.

Based on our recent Disaster Recovery survey, IT professionals define downtime as the duration of time in which the service to your customers is unavailable or interrupted—even if your cloud infrastructure is up & running. Download the survey report here.

Therefore, your cloud provider’s downtime is only a part of the whole downtime picture. We all know that the cloud allows organizations to focus on the core business without worrying about managing IT infrastructure. However, as cloud integrates multiple services running on geo-redundant data-centers with complex architecture, managing downtime can be challenging.

The promise of the cloud is so alluring that it’s sometimes easy to forget, we’re at the beginning of a decade-long adoption curve.

Here are five things most IT managers never consider about cloud downtime:

1. SLAs are not always met

It seemed like just any other Sunday on August 26, 2013. At Instragram, Facebook’s billion dollar (and hyper-popular) mobile picture-sharing app, the offices were all but deserted. Suddenly, at 1:00 pm, a laid-back Sunday afternoon turned into an IT nightmare.  For 59 whole minutes, Instagram was down.  Hosted on Amazon Web Services (AWS), the outage was triggered by a failure at the provider’s US East data-center in northern Virginia, affecting other companies as well, including Vine, Airbnb and Flipboard.

Cloud uptime is typically stated in the SLA with the cloud service provider.  AWS for example, guarantees 99.95% availability, which equals to 21.56 downtime minutes per month. But do SLAs really mean that your app is going to be up 99.5% of the time?

Not necessarily. For example, Instagram was down for 59 minutes, and on Christmas 2012, AWS had an outage of 20 hours—tough luck for those late holiday shoppers.  Therefore, the SLA provided by your cloud service provider is not exactly set in stone and if your service is down for more than what is stated, all you can do is apply for a miserly service credit of up to 30%, while counting the mounting damage to your business.

2. Downtime increases with multiple services

While an SLA of 99.95% is aggressive, it is per service or region. While it may seem that relying on more services actually reduces your risk, in reality it actually increases the likelihood for downtime or low quality of service. In fact, your cloud downtime actually increases exponentially with every additional service that you use.

For example:

  • One service at 99.95% SLA: 99.95%
  • Three services at 99.95% SLA: 99.85%

Just like that, your downtime increased threefold—from 0.05% to 0.15%.  Therefore, when managing downtime, it is important to account for all services that may interrupt or degrade your service. If you require a 3-nines, or 99.9% uptime, even though Amazon promises 99.95% uptime, it did not do all of the work for you.

3. Downtime of third party services

Many apps and websites apply more than one cloud-based service ranging from hosting, code libraries, checkout and transaction services, analytics and many other services.  As the Web is becoming more integrated, outages may affect Web apps indirectly.

Whether a Web app is hosted on premise or in the public cloud, it typically uses other APIs, services and components that are hosted in the cloud.  Therefore, cloud outages may affect Web apps that are not directly using that service and limit functionality or degrade user experience. Your SLA with your cloud service provider does not protect you against third party downtime.  For example, for ecommerce, if your third party shopping cart service is down, but your website is available, it would traditionally be considered as uptime.  But is it?  You may suffer significant losses as well as frustrated users calling your customer service center.

Furthermore, third party monitoring like Pingdom or New Relic may not provide the full usability you need in order to make sure that your service is fully functional.  Therefore, when managing your expected downtime, take all third-party services into account as well.

4. Cloud is up, service is down

One small thing that may have slipped your mind: your cloud service provider guarantees uptime for their service, not for yours.  This means that even if the cloud is up, your app may be down. SLA only refers to the physical availability of your virtual machine (VM).  Your VM may be running, but you may not have access to the application. This would typically not count towards your uptime SLA, even though your application was not available.  There can be plenty of reasons why apps may not be available and some of them may entirely be the fault of the development and deployment teams.

5. Bad quality of service

While your cloud service provider promises availability, many boilerplate SLAs do not guarantee the quality of service (QoS).  While your Quality of Service may be degraded unexpectedly, your application is still available and therefore this would not count towards your SLA agreement.  QoS issues are also hard to monitor and may not be reported by the cloud service provider, as the service is actually available, although the user experience is degraded.

What is your true downtime?

Managing downtime in the cloud is far more complicated than just reviewing your cloud provider’s SLA. In fact, if you count downtime as unavailable or interrupted service for your users, there are other factors to consider besides your cloud service.  The five considerations outlined here should help you manage and mitigate your downtime and optimize your business availability.