It’s Monday morning and you have to add an instance to your busy cluster. As you prepare to run your script, you’re thinking this week is off to a productive start until . . . it fails. There isn’t enough coffee in the world to give you the kind of jolt you just experienced.
How did the script fail? How could Amazon Web Services, the king of the cloud, do this to you?! Maybe it’s something you forgot when writing the script?
When developing cloud automation, it is easy to concentrate on the normal flow of the code. It is much more challenging to keep in mind that code can fail, and it is even harder to remember that failure can be caused by external factors such as your cloud platform. That’s why when writing AWS cloud automation, you should always take into consideration the three biggest obstacles to infallible script: request rate limit, eventual consistency, and API idempotency. We’ve summarized below what you need to know about each of these challenges, including a handy list of do’s and don’ts.
Amazon throttles API rates to provide overall quality of service. While at first glance this may not seem like a big deal, you probably won’t feel that way when you need to create many resources, and fast. The request rate limit in AWS is calculated per account. So multiple users and third-party services make this problem even worse.
The request rate limit is fixed which makes requests a limited resource. Therefore, this also becomes a question of priorities. Not all scripts are equally important to production. We’re quite sure that you’d rather have your monitoring system not work than a production problem, right?
Now let’s divide the automation to three types: non-production scripts, production non-critical scripts, and mission-critical automation.
An example of non-production scripts is a periodic script that calls DescribeInstances, which reports the number of instances to a monitoring system. Since the script is periodic by nature, if it fails from time to time, it might be okay — next time it will work. For example, if you receive 1% request rate exceeded errors, then the script will still report the result 99% of the time.
The second type of scripts are non-critical production scripts. An example of such script is a weekly script that creates backup images for all your EC2 instances. Though it won’t cause downtime to your production, if it fails you are left with no backup. Furthermore, it could be that running it again won’t work. For example, if you have 1,000 instances to backup and there is a 1% chance of request rate exceeded error, it could be that your last (1,000th) instance will never be backed up.
In other words, non-critical product scripts need to work, but it’s not the end of the world if it takes more time than expected. It might also be okay for it to fail so that you can fix it. With this type of script you should run APIs in a serial manner, follow AWS guidelines, and use exponential backoff. Chances are that the library you are already using performs good enough retries on errors.
The 55 top AWS tips are now available in the 2017 AWS Insider’s Guide. Download the guide here.
The third type of automation is mission-critical production scripts, like spinning up a pilot-light DR. In these cases, it is not only important that the script always succeed, but it should work as fast as possible. In this type of automation, you should run commands concurrently. Using plain old exponential backoff might not be good enough in these situations. For example, if you use a five second wait time, the seventh retry of the API will happen after more than ten minutes. Therefore you should use exponential backoff but with a small change: limit the maximum wait time between retries.
AWS follows an eventual consistency model to balance availability and consistency. When writing a robust script, you’ll need to keep in mind that your AWS resources are available “eventually” and may not be immediately visible to all subsequent commands you run.
This model can and should affect the way you manage your resources and plan your script. Say you run a command to modify or describe the resource that you just created. Its ID might not have propagated throughout the AWS system and therefore may return an error stating that the resource does not exist. Back-to-back commands will not provide enough time for the AWS environment to update and execute successfully.
In the case of a newly created subnet in a VPC, if your script launches an instance in that subnet, it may fail. Eventual consistency could be why the subnet appears to be immediately unavailable.
It would be nice if AWS could tell us that “eventual” means five seconds or even five minutes, but that would be too easy. What we do know is that some APIs take longer to reach consistency than others. Changing the LaunchPermission of ModifyImageAttribute or certain connectivity issues between AWS data centers means longer waiting periods.
Idempotency can be a tricky subject. API idempotency refers to the ability to call an API several times while producing the same outcome as if you had called the API once. Let’s use a dog as an example. If you feed a dog once, it will be full. But if you keep feeding the dog, you will end up with a fat puppy. Therefore, feeding a dog is not idempotent. In contrast, petting a dog once will make it happy, just as petting it many times will make it happy. This means petting a dog is idempotent.
API idempotency is important when writing automation for a remote service given the increased risk of communication errors. The default solution for communication errors is to retry the API that has failed. Without API idempotency, two disastrous situations could occur.
First, the call might be performed twice. Try explaining to your boss why your script has launched two x1.32xlarge instances instead of one. That could be a very costly error.
Second, the operation might fail. With a communication failure, your script will first receive the error and then it will retry. The problem is that your first call was actually performed, causing the retry to fail. For example, when you try to create a subnet with a specific CIDR block and a communication error occurs, your retry API might fail as the subnet was created already.
AWS has set up standards that put them at the top of the cloud market. Their product allows for impactful cloud automation, but without a flawless script you leave yourself open to urgent problems.
At CloudEndure we guarantee a robust application. Because our APIs are reviewed and verified for the three challenges outlined in this post, we can ensure continual success in creating AWS environments for both disaster recovery and migrations. Learn more.