After yet another cloud outage yesterday (see AWS’s S3 outage was so bad Amazon couldn’t get into its own dashboard to warn the world) the world (or at least its North American part) once again went crazy how dangerous the cloud is and how you should go build your own data center because you know better what is good for your business.
Putting aside all the hype as well as some quite senseless social media posts about AWS SLAs, here is our thought process for developing highly available cloud services or more importantly making conscious decisions what our services’ SLAs should be.
I will base this post on my customer experience with Docker that was impacted by the outage yesterday but also walk you through the thought process for our own services. Without knowing Dockers business strategy I will speculate a bit but my goal is to walk you through the process and not define Docker’s HA strategy. For those who are not familiar what the problem with Docker was, Docker’s public repository is hosted on S3 and was not accessible during the outage.
The first thing we look at is, of course, the business impact of the service. Nothing new here! Thinking about Docker’s registry outage here are my thoughts:
- An outage may impact all customer deployments that use Docker Hub images. Theoretically, this is every one of Docker’s customers. Based on this only the impact can be huge
- On the other side though Docker’s enterprise (small and big) customers customize the images they use and most probably store them in private repositories. Docker’s outage doesn’t necessarily impact those private repositories, which means that we can lower the impact
- Docker is a new company though and their success is based on making developers happy. Those developers may be constantly hacking something (like for example my case yesterday:)) and using the public repository. Being down will make the developers unhappy and will have an impact on Docker’s PR
- In addition, Docker wants to establish itself as THE company for the cloud. Incidents like yesterday’s may have a negative impact on this inspiration mainly from PR and growth point of view
With just those simple points, one can now make a conscious decision that the impact of Docker’s public repository being down is most probably high. What to do about it?
The simplest thing you can do in such a situation is to set the expectations upfront. Calculate a realistic availability SLA and publish it on your site. Unfortunately, looking at Docker Hub’s site I was not able to find one. In general, I think cloud providers bury their SLAs so deep that it is hard for customers to find them. Thus, people search on Google or Bing and start citing the first number they find (relevant or not), which makes the PR issue even worse. I would go even further – I would publish not only the 9s of my SLA but also what those 9s equate to in time, and whether this is per week, month or year. Taking, for example, the Amazon’s S3 SLA, after being down for approximately 3 hours yesterday, if we consider it annually, they are still within their 8h 45min allowed downtime.
Now, that you made sure that you have a good answer to your customers, let’s think how can you make sure that you keep those SLAs intact. However, this doesn’t mean that you should go ahead and overdesign your infrastructure and spin up a multimillion project that will provide redundancy for every component of every application you manage. There were a lot of voices we’ve heard yesterday calling for you to start multi-cloud deployments immediately. You could do that but is this the right thing?
I personally like to think about this problem gradually and revisit the HA strategy on a regular basis. During those reviews, you should look at the business requirements as well as what is the next logical step to make improvements. Multi-cloud can be in your strategy long term but this is certainly much bigger undertaking than providing quick HA solution with your current provider. In yesterday’s incident, the next logical step for Docker would be to have a second copy of the repository in US West and ability to quickly switch to it if something happens with US East (or vice versa). This is a small incremental improvement that will make a huge difference for the customers and boost Docker’s PR because they can say: “Look! we host our repository on S3 but their outage had minimal or no impact on us. And, by the way, we know how to do this cloud stuff.” After that, you can think about multi-cloud and how to implement it.
Last, but not least your HA strategy should be also tied to your monitoring, alerting, remediation but also to your customer support strategy. Monitoring and alerting is clear – you want to know if your site or parts of it are down and take the appropriate actions as described in your remediation plan. But why, your customer support strategy? Well, if you haven’t noticed – AWS Service Dashboard was also down yesterday. The question comes up, how do you notify your customers of issues with your service if your standard channel is also down? I know that lot of IT guys don’t think of it but Twitter turns out a pretty good communication tool – maybe you should think of it next time your site is down.
Developing solid HA strategy doesn’t need to be a big bang approach. As everything else, you should ask good questions, do incremental steps, fail and learn. And most importantly, take responsibilities for your decision and don’t blame the cloud for all bad things that happen with your site.