Disaster recovery (DR) boils down to one critical moment
When every other piece of your application fails, initiating a failover is a last resort to ensure your service remains available. Unfortunately, many DR failover attempts fail even in drill situations. More often than not, failover success is a result of meticulous, long-term planning. Amazon Web Services did a great job of keeping up with MySQL and providing a managed relational database service in RDS. The most recent support for MySQL 5.6 enables AWS to take advantage of robust cross-region read replica functionality.
As you can imagine, simply replicating your database to another region is just a first step. Connecting the rest of your application stack both in the production and recovery regions is the real challenge. The more components to your application, the more complex your replication and recovery process will be. With proper setup and maintenance, RDS is a hand-in-glove fit with cloud disaster recovery. However, your DR failover will only succeed if you follow the following instructions. Miss any one of these critical steps, and you are liable to corrupt your database and set yourself up for a colossal failure.
1. Set up your production RDS instance in a multi-AZ architecture. This is true of any mission-critical service, but especially important for a database you intend to replicate across regions.
2. Enable RDS daily backups – Cross-region replication will simply not work without this. Your RDS read replica will be terminated before replication even begins.
3. Create a cross-region Read Replica in the recovery region of your choice. As its name might suggest, a read replica is an exact copy of your production database, which is then updated continuously in another region. It’s essentially the critical “missing link” required to make cloud DR a viable solution.
4. Maintain application stack in both regions. Remember that your RDS instance is just one component of your application. In order to rely on your recovery site, your entire application must work redundantly in either region.
5. Maintain database settings across both regions (including parameters, size, performance, etc.). Every time you update your primary RDS instance, you MUST make the exact same change in the recovery region down to parameter names, instance size, and performance requirements.
IMPORTANT: Even the most subtle variation between the two instances is liable to render your DR system useless at best, or corrupt both recovery AND production sites at worst.
6. Configure alerts to identify errors in sync between the two regions. Amazon RDS is an excellent resource, but it does fail. Configuring alerts using a service like Amazon SNS will keep you abreast of any issues and enable you to keep a synchronized, operational recovery site.
7. Schedule your first DR drill! You won’t know you’re ready to deal with disaster without a proper simulation. The DR drill process (also known as read replica promotion) requires you to spin up your entire application stack in the recovery region to test that all your moving parts work together when you need them. Don’t forget to refer your new machines to the recovery database name (a commonly overlooked but critical detail). If you don’t, your application will update the production AND the recovery database, corrupting both and leaving you in a very uncomfortable hot seat.
In a perfect world, step 7 would be the high point of your DR experience — with so many “Nines” on your availability record the CIO would be on the phone with Guinness in nanoseconds. Of course, in a perfect world nobody would need DR in the first place. Like or not, step 8 is just a matter of time (well, downtime actually). And to be honest, don’t you secretly WANT this to happen every once in a while? After all, how else would you know your DR works?
8. Failover. Some IT departments have been know to schedule an actual disaster simulation on their own terms (how’s that for taking chaos monkey to the next level?). This could admittedly become a nerve-racking exercise, but it’s far better than getting caught by an unplanned downtime event. Scheduled or not, after you go through a failover of your production application, there’s no going back. Once your failover to the recovery region is complete, it’s time to redirect your live traffic and watch your hard work play out. Of course, you’ve effectively turned the replica instance into your production environment. So now it’s time to go back to steps 3 through 7 before you can truly go back to business as usual.
9. Failback. Now that you’ve experienced the behavior of a controlled disaster, it’s time to gradually resume operations in the original region. In essence, you’re going through the failover exercise all over again, but this time the “recovery” region becomes your original production region you started out with.
And now that you’ve lived to see an actual disaster occur and lived to tell the tale, you should feel MUCH more relaxed if (or when) an actual unplanned downtime event strikes.