If you’re using Amazon’s relational database service, AWS RDS, you can leverage its powerful replication capabilities for a variety of cloud applications, particularly when implementing disaster recovery (DR) or migration. Although AWS did a great job of providing a robust feature-set, even IT veterans find it challenging to implement replication on AWS.
Here’s a list of common mistakes you should watch for:
1. Leave daily backups disabled
If your RPO (recovery point objective) is not aggressive, enabling backups is a good idea first and foremost because it is the simplest replication mechanism to implement. More important, you can’t even begin to work with read-replicas (enabling dramatically lower RPO) without enabling backups.
2. Ignore multi-AZ (availability zone) architecture
A sure-fire way to cripple any mission-critical application, neglecting to make use of multi-AZ architecture on AWS will damage service availability. In particular, enabling multi-AZ on RDS is the simplest way to replicate the instance within the same region. If you’re replicating across regions, isolating RDS in a single AZ will introduce downtime as a direct result of the replication mechanism (e.g. backups, read-replica, CloudEndure, or a combination).
3. Attempt to replicate from scratch instead of creating a read-replica in the target region
Don’t try to re-invent the wheel (or cloud for that matter). Amazon created cross-region read replicas on RDS precisely to address complex use cases such as DR and automated migration. Use read-replicas!
4. Replicate RDS only (leaving the rest of the application out of the process)
Successful replication means your application must be able to function redundantly in both source and target locations. That means replicating the rest of your application to the same target location so that every component can work with your RDS instance.
5. Configure RDS Read-replica Instance differently than the source
Your RDS read-replica properties must match the source instance precisely. Even if the data is replicated perfectly, even a slight change to the instance properties could stop the replication.
6. Forget to set up replication alerts
Every service fails from time to time, including RDS replication using read-replicas. If you don’t configure alerts with a service such as Amazon SNS, you’ll have no way of knowing if the read-replica on your target region is up to date, which will directly affect your ability to meet RPO requirements.
7. Refer instances in recovery/migration site to source RDS instance
During failover or migration, you promote the read-replica to function as part of your entire application stack. But if you don’t update the instances in the recovery region to access the read-replica instead of the original RDS instance, your replica application will continue to work with the original instance which is liable to corrupt your production environment.
8. Never schedule a drill
If you don’t test your application to withstand a scheduled downtime event, what will happen when an actual crisis catches you off guard? It’s best to control the chain of events on your own terms rather than pray and hope for the best. The drill process “promotes” your read-replica, spins up replicas of the other components of your application, and then connects them to each other in the target region.
9. Fail to plan for “failback” after failover
After you’ve successfully completed a failover, you could decide to revert back to your original site (e.g. once the disaster has been resolved). This means repeating the entire failover process in reverse so you can actually go back to business as usual.