The transparency with which GitLab handled their recent 18-hour outage was a game-changer, giving all of us an incredible opportunity to learn some critical lessons about disaster recovery.
You’ve probably heard of GitLab, the open-source based company that provides Git repository management. And if you were anywhere near the Internet on January 31 or February 1, you most likely heard about the major outage that GitLab’s online service experienced. The outage involved the accidental removal of data from a primary database server, causing the loss of customer data and service downtime of about 18 hours.
While outages, even major outages like GitLab’s, are not surprising, what was unexpected and truly impressive in the case of GitLab was the transparency with which they handled the disaster, both in real time and post-mortem. This level of transparency cannot be taken for granted, as most companies don’t supply this level of detail. That’s why this is a unique and critical learning opportunity.
Here at the CloudEndure office, we’ve been talking a lot about the GitLab outage. As developers who spend our days building disaster recovery (DR) solutions, we decided to share what we think are the most critical lessons from the GitLab outage. Let us know what you think.
1. Expect the Unexpected
In the case of the GitLab outage, one might ask, “who runs ‘rm-rf’ on a production system?” However, it’s not that simple. It seems that YP (the GitLab employee involved in the outage, according to the documentation) handled a critical performance problem on production when he erased data from the primary database server.
Running commands on production happens every day in most companies. Even if there is policy stating that this shouldn’t be done, in reality employees may be unaware of the policy or may decide that the urgency at hand (whatever it may be) is important enough to justify an exception to the policy.
Moreover, the same outage problem could have originated from multiple sources, for example, software bugs or even security problems. (Security problems killed Code Spaces, which failed to recover from their disaster.)
Unfortunately, you can’t predict the origin of your next disaster. Even the leading public and private cloud providers have had outages of network, compute, and storage services.
The uncertainty of when and where disasters are going to strike is a major challenge in preparing for them. Therefore, GitLab is correct when they conclude that their “main focus is to improve disaster recovery.” How should they start improving their DR? We go into this in the lessons below.
2. DR Without Drills Is Not DR
One of the most important components of a strong DR strategy is the DR drill. In complex systems there are some effects that are very hard to anticipate. On top of confirming that the DR works, a DR drill gives IT teams a better sense of how much data may be lost and how much time it will take to recover.
DR drills also give the operating team hands-on experience using all the relevant DR tools, which will enable them to respond more quickly during a disaster and decrease downtime. In some cases, performing the drill is very complicated and often fails. In other cases, when the drill is not performed frequently, the DR is found broken and then fixed, but by the next drill is broken again.
In the case of GitLab, there was no owner for backup drills and, in effect, no procedure for DR drills. The fact that the S3 backup did not work would have been discovered in a drill. If more drills had been implemented, it is likely that they would have discovered the fact that most of their applied data recovery mechanisms were not frequent enough. GitLab’s mechanisms were daily (rather than hourly or continuous), but it turned out that even just six hours of data loss had severe consequences for the company and their customers.
3. Define Your Recovery Objectives
The RPO (recovery point objective) is critical in DR planning. GitLab used four recovery procedures, but none of them had designated goals. Three of the recovery procedures were “daily snapshots” and one was online replication to a secondary DB. So what was the problem? The corruption was immediately replicated to the secondary DB, and their daily backup was not frequent enough.
When developing your DR strategy, always ask yourself how much data are you willing to lose, or, in other words “how much will it cost my organization if I lose X data?” If your organization cannot tolerate six hours of lost data, a daily backup is not enough. The only way to ensure that no data is lost is by using continuous replication technology.
Another essential goal that must be defined in DR planning is RTO (recovery time objective). When designing a DR solution, determine how much downtime is acceptable for your business and for different workloads. Some workloads may be able to tolerate 24 hours of downtime, while the downtime of others may simply be too costly. If you need rapid recovery, backup will not be good enough.
Once you have determined your RPO and RTO, and implemented a DR solution accordingly, it’s critical to conduct drills (as noted above) to make sure you can achieve your recovery objectives as your applications grow and change.
In the case of GitLab, their 18-hour recovery time was not adequate for their business needs.
4. Do-It-Yourself DR Has Limitations
GitLab had what we like to call “do-it-yourself” disaster recovery. Sure, smart developers can build some decent-looking backup or DR systems, but why risk your company’s stability, data integrity, and future with a DIY solution? The costs of a failed DR solution are too high.
In our view, building a DR solution on your own when you’re not a DR expert is like trying to fix your car engine or the electrical wiring in your home when you’re not a mechanic or electrician. The potential risks of a failed attempt (e.g. destroying your car’s engine, causing a fire in your home, or being electrocuted) are just too severe.
Ready-made, enterprise-grade DR solutions have been developed by people whose main focus is recovery. Moreover, the solutions have been tested in a wide variety of use cases, which is critical for the unpredictability of disasters.
In the GitLab case, there were multiple DIY recovery solutions. Some were used for the given DB and other were used for other parts of the system. In real time, it was not clear which system was working and what needed to be done to recover. DR solutions need to be global and should be implemented in one click for all systems, if needed. This capability is usually hard to achieve in DIY solutions.
5. Point-in-Time Recovery Is Key
Many types of disasters, such as hardware failures, power outages, and network problems, require only the latest data to recover — ideally, up to the last second. For other types of disasters, recovering only the the latest data is not sufficient. This is especially true in cases of undetected bugs, data corruption, malware, and ransomware. It is also true in cases of human error, which was the cause in the GitLab disaster.
Every serious DR solution must provide recovery options to various points in time over the previous weeks. The level of granularity provided should be based on your business goals and requirements. However, in most cases once-a-day point-in-time recovery is not sufficient.
The ideal point-in-time recovery solution should not create storage overhead by cloning all the data for each point in time as this would be prohibitively expensive for most organizations. Rather, it should use incremental points in time in a staging infrastructure.
As CloudEndure employees, we know we’re biased, but we can confidently say that CloudEndure provides homogenous, hassle-free DR with near-zero data loss, multiple recovery points in time, one-click recovery, and easy testing. We have seen how it works for our customers and know that an outage like GitLab’s would never have happened with our solution in place. If your organization needs a robust, reliable DR solution, give it a try. You can schedule a demo or request a free trial today.