After yet another cloud outage yesterday (see AWS’s S3 outage was so bad Amazon couldn’t get into its own dashboard to warn the world) the world (or at least its North American part) once again went crazy how dangerous the cloud is and how you should go build your own data center because you know better what is good for your business.

Putting aside all the hype as well as some quite senseless social media posts about AWS SLAs, here is our thought process for developing highly available cloud services or more importantly making conscious decisions what our services’ SLAs should be.

I will base this post on my customer experience with Docker that was impacted by the outage yesterday but also walk you through the thought process for our own services. Without knowing Dockers business strategy I will speculate a bit but my goal is to walk you through the process and not define Docker’s HA strategy. For those who are not familiar what the problem with Docker was, Docker’s public repository is hosted on S3 and was not accessible during the outage.

The first thing we look at is, of course, the business impact of the service. Nothing new here! Thinking about Docker’s registry outage here are my thoughts:

  • An outage may impact all customer deployments that use Docker Hub images. Theoretically, this is every one of Docker’s customers. Based on this only the impact can be huge
  • On the other side though Docker’s enterprise (small and big) customers customize the images they use and most probably store them in private repositories. Docker’s outage doesn’t necessarily impact those private repositories, which means that we can lower the impact
  • Docker is a new company though and their success is based on making developers happy. Those developers may be constantly hacking something (like for example my case yesterday:)) and using the public repository. Being down will make the developers unhappy and will have an impact on Docker’s PR
  • In addition, Docker wants to establish itself as THE company for the cloud. Incidents like yesterday’s may have a negative impact on this inspiration mainly from PR and growth point of view

With just those simple points, one can now make a conscious decision that the impact of Docker’s public repository being down is most probably high. What to do about it?

The simplest thing you can do in such a situation is to set the expectations upfront. Calculate a realistic availability SLA and publish it on your site. Unfortunately, looking at Docker Hub’s site I was not able to find one. In general, I think cloud providers bury their SLAs so deep that it is hard for customers to find them. Thus, people search on Google or Bing and start citing the first number they find (relevant or not), which makes the PR issue even worse. I would go even further – I would publish not only the 9s of my SLA but also what those 9s equate to in time, and whether this is per week, month or year. Taking, for example, the Amazon’s S3 SLA, after being down for approximately 3 hours yesterday, if we consider it annually, they are still within their 8h 45min allowed downtime.

Now, that you made sure that you have a good answer to your customers, let’s think how can you make sure that you keep those SLAs intact. However, this doesn’t mean that you should go ahead and overdesign your infrastructure and spin up a multimillion project that will provide redundancy for every component of every application you manage. There were a lot of voices we’ve heard yesterday calling for you to start multi-cloud deployments immediately. You could do that but is this the right thing?

I personally like to think about this problem gradually and revisit the HA strategy on a regular basis. During those reviews, you should look at the business requirements as well as what is the next logical step to make improvements. Multi-cloud can be in your strategy long term but this is certainly much bigger undertaking than providing quick HA solution with your current provider. In yesterday’s incident, the next logical step for Docker would be to have a second copy of the repository in US West and ability to quickly switch to it if something happens with US East (or vice versa). This is a small incremental improvement that will make a huge difference for the customers and boost Docker’s PR because they can say: “Look! we host our repository on S3 but their outage had minimal or no impact on us. And, by the way, we know how to do this cloud stuff.” After that, you can think about multi-cloud and how to implement it.

Last, but not least your HA strategy should be also tied to your monitoring, alerting, remediation but also to your customer support strategy. Monitoring and alerting is clear – you want to know if your site or parts of it are down and take the appropriate actions as described in your remediation plan. But why, your customer support strategy? Well, if you haven’t noticed – AWS Service Dashboard was also down yesterday. The question comes up, how do you notify your customers of issues with your service if your standard channel is also down? I know that lot of IT guys don’t think of it but Twitter turns out a pretty good communication tool – maybe you should think of it next time your site is down.

Developing solid HA strategy doesn’t need to be a big bang approach. As everything else, you should ask good questions, do incremental steps, fail and learn. And most importantly, take responsibilities for your decision and don’t blame the cloud for all bad things that happen with your site.

For a while already we have been working with a large enterprise client, helping them to migrate their on-premise workloads to the cloud. Of course, as added value to the process, they are also migrating their legacy development processes to the modern, better, agile DevOps approach. And of course, they have built a modern Continuous Integration/Continuous Delivery (CI/CD) pipeline consisting of Bitbucket, Jenkins, Artifactory, Puppet and some relevant testing frameworks. “It is all great!”, you would say “what is the problem?”.

Because I am on all kind of mailing lists for this client, I noticed recently that my dedicated email inbox started getting more and more emails related to the CI/CD pipeline. Things like unexpected Jenkins build failures, artifacts cannot be downloaded, server outages and so on and so on. You already guessed it – emails that report problems with the CI/CD pipeline and prevent development teams from doing their job.

I don’t want to go into details what exactly went wrong with this client but I will only say that year ago when we designed the pipeline there were few things in the design that never made it into the implementation. The more surprising part for me though is, that if you search on the internet for CI/CD pipelines you will get the exact picture of what our client has in production. The problem is that all the literature about CI/CD is narrowly focused on how the code is delivered to its destination and the operational, security and business side of CI/CD pipeline is completely neglected.

Let’s step back and take a look at how a CI/CD pipeline is currently implemented in the enterprise. Looking at the picture below there are few typical components included in the pipeline:

  • Code repository – typically this is some Git flavor
  • Build tools like Maven
  • Artifacts repository – most of the times this is Nexus or Artifactory
  • CI automation or orchestration server – typically Jenkins in the enterprise
  • Configuration management and deployment automation tools like Puppet, Chef or Ansible
  • Various test automation tools depending on the project requirements and frameworks used
  • Internal or external cloud for deploying the code

Typical CI/CD Pipeline Components

The above set of CI/CD components is absolutely sufficient to getting the code from the developer’s desktop to the front-end servers and can be completely automated. But those components do not answer few very important questions:

  • Are all components of my CI/CD pipeline up and running?
  • Are the components performing according to my initial expectations (sometimes documented as SLAs)?
  • Who is committing code, scheduling builds and deployments?
  • Is the feature quality increasing from one test to another or is it decreasing?
  • How much each component cost me?
  • What is the overall cost of operating the pipeline? Per day, per week or per month? What about per deployment?
  • Which components can be optimized in order to achieve faster time to deployment and lower cost?

Too many questions that none of the typical components listed above can provide a holistic approach to answer. Jenkins may be able to send you a notification if a particular job fails but it will not tell you how much the build costs you. Artifactory may store all your artifacts but will not tell you if you are out of storage or give you the cost of the storage. The test tools can give you individual test reports but rarely build trends based on feature or product.

Hence, in our implementations of a CI/CD pipelines we always include three additional components as shown in the picture below:

  • Monitoring and Alerting Component that is used to collect data from each other component of the pipeline. The purpose of this component is to make sure the pipeline is running uninterrupted as well as to collect data used for the business reporting. If there are some anomalies alerts are sent to the affected parties
  • Security Component used not only to ensure consistent policies for access but to also provide auditing capabilities if there are requirements like HIPAA, PCI, SOX etc
  • Business Dashboarding and Reporting Component used for providing financial and project information to business users and management

Advanced CI/CD Pipeline Components

The way CI/CD pipelines are currently designed and implemented is yet another proof that we as technologists neglect important aspects of the technologies we design – security, reliability, and business (project and financial) reporting are very important to the CI/CD pipeline users and we should make sure that those are included in the design from get go and not implemented as an afterthought.

Recently I had to design the backup infrastructure for cloud workloads for a client in order to ensure that we comply with the Business Continuity and Disaster Recovery standards they have set. However, following traditional IT practices in the cloud quite often poses certain challenges. The scenario that we had to satisfy is best shown in the picture below:

Agent-Based Backup Architecture

The picture is quite simple:

  1. Application servers have a backup agent installed
  2. The backup agent submits the data that needs to be backed up to the media server in the cloud
  3. The cloud media server submits the data to the backup infrastructure on premise, where the backups are stored on long-term storage according to the policy

This is a very standard architecture for many of the current backup tools and technologies.

Some of the specifics in the architecture above are that:

  • The application servers and the cloud media server exist in different accounts or VPCs if we use AWS terminology or virtual networks or subscriptions if you consider Microsoft Azure terminology
  • The connectivity between the cloud and on-premise is established through DirectConnect or ExpressRoute and logically those are also considered separate VPCs or virtual networks

This architecture would be perfectly fine if the application servers were long-lived, however, we were transitioning the application team to a more agile DevOps process, which meant that they will use automation to replace the application servers with every new deployment (for more information take a look at the Blue/Green Deployment White Paper published on our company’s website). This, though, didn’t fit well with the traditional process that the IT team, managing the on-premise Netbackup infrastructure, uses.  The main issue was that every time one of the application servers gets terminated, somebody from the on-prem IT team will get paged for failed backup, and trigger an unnecessary investigation.

One option for solving the problem, presented to us by the on-premise IT team, was to use traditional job scheduling solutions to trigger script that will create the backup and submit it to the media server. This approach doesn’t require them to manually whitelist the IP addresses of the application server into their centralized backup tool, and will not generate error event but involved additional tools that would require much more infrastructure and license fees. Another option was to keep the old application servers running longer so that the backup team has enough time to remove the IPs from the white-list. This, though, required manual intervention on both sides (ours and the on-prem IT team) and was prone to errors.

The approach we decided to go with required a little bit more infrastructure but was fully automatable and was relatively cheap compared to the other two options. The picture below shows the final architecture.

The only difference here is that instead of running the backup agents on the actual application instances, we run just one backup agent on a separate instance that has an unlimited lifespan and doesn’t get terminated with every release. This can be a much smaller instance than the ones used for hosting the application, which will save some cost, and its role is only for hosting the backup agent, hence no other connections to it should be allowed. The daily backups for the applications will be stored on a shared drive that is accessible on the instance hosting the agent, and this shared drive is automatically mounted on the new instances during each deployment. Depending on whether you deploy this architecture in AWS or Azure, you can use EFS or Azure Files for the implementation.

Here are the benefits that we achieved with this architecture:

  • Complete automation of the process that supports Blue/Green deployments
  • No changes in the already existing backup infrastructure managed by the IT team using traditional IT processes
  • Predictable, relatively low cost for the implementation

This was a good case study where we bridged the modern DevOps practices and the traditional IT processes to achieve a common goal of continuous application backup.

Over the last few days, I was looking for a way to automate our deployment environments on Azure and also investigating automation frameworks for a customer. The debate was between Terraform and Ansible and the following article from Gruntwork did really good work on tilting the weight towards Terraform. We have similar considerations like the guys from Gruntwork so everything matched well. Now, the task was to get Terraform working with Azure, which was a small challenge compared to AWS.

For those of you interested in the background here is the Terraform documentation for Azure provider, which is pretty good but missed a small piece about assigning a role as described in this StackOverflow post. Of course, the post points to the Azure documentation about using the CLI to assign the role to the principal, which for me ended up with the following error:

Principals of type Application cannot validly be used in role assignments.

At the end of the day because of time pressure, I wasn’t able to figure out the CLI way to do that but it seems there is a way to do it through the Azure Management Portal so here are the steps with visuals:

Create Application Registration in Your Azure Subscription

  • Go to the new Azure Portal at http://portal.azure.com and select Azure Active Directory in the navigation pane on the left:

  • Select App Registrations from the tasks blade

  • Click on the Add button at the top of the blade and fill in the information for the Terraform app. You can choose any name for the Name field as well as any valid URL string for the Sign-on URL field. Click on the Create button to create the app.

  • Click on the newly created app and in the Settings blade select Required Permissions

  • Click on the Add button at the top of the blade

  • In Step 1 Select an API select the Windows Azure Service Management API and click on the Select button

  • In Step 2 Select Permissions select Access Azure Service Management as organization users (preview) and click on the Select button

  • Click on the Done button to complete the flow

Now, you have your App Registration complete however you still need to assign a role for your application. Here is how this is done.

Assign a Role for Terraform App to Use ARM

Assigning role to your application is done on a Subscription level in Azure Portal.

  • Select Subscriptions in the navigation pane on the left

  • Select the subscription where you have registered the app and select Access Control (IAM) in the task blade

  • Click on the Add button at the top of the blade and in Step 1 Select a role choose the most appropriate role for your Terraform application

Although you may be tempted to choose Owner in this step I would suggest thinking your security policies through and selecting role that has more restrictive access. For example, if you have DevOps people running Terraform scripts you may want to give them Contributor role and prevent them from managing the user access. Also, if you have database team that wants to only manage Azure SQL and DocumentDB you may just restrict them to SQL DB Contributor and DocumentDB Account Contributor. List of built-in RBAC roles for Azure is available here.

  • In Step 2 Add Users type the name of your app in the search field and select it from  the list. Click on the Select button to confirm

  • Click the OK button to complete the flow

Collecting ARM Credentials Information for Terraform

In order for Terraform to connect to Azure and manage the resources using Azure Resource Manager you need to collect the following information:

  • Subscription ID
  • Client ID is also known as Application ID in Azure terminology
  • Client Secret is also known as Key in Azure terminology
  • and Tenant ID is also known as Directory ID in Azure terminology

Here is where to obtain this information from.

Azure Subscription ID

Click on Subscriptions in the navigation pane -> Select the subscription where you created the Terraform app and copy the GUID highlighted in the picture below.

Azure Client ID

What Terraform refers to as Client ID is actually the Application ID for the app that you just registered. You can get it by selecting Azure Active Directory -> App registrations -> select the name of the app you just registered and copy the GUID highlighted in the picture below.

Azure Client Secret

What Terraform refers to as Azure Client Secret is a Key that you create in your App registration. Follow these steps to create the key:

  • From Azure Active Directory -> App registrations select the application that you just created and then select Keys in the Settings blade

  • Fill in the Key description, select Duration and click on the Save button at the top of the blade. The Key value will be shown after you click the Save button.

Note: Copy and save the key value immediately. If you navigate away from the blade you will not be able to see the value anymore. You can delete the key and create a new one in the future if you lose the value.

Azure Tenant ID

The last piece of information you will need to connect Terraform to Azure Resource Manager is a Tenant ID, which is also known as Directory ID in Azure terminology. This is actually the GUID used to identify your Azure Active Directory.

Select Azure Active Directory and scroll down to show the Properties in the tasks blade. Select Properties and copy the GUID highlighted on the picture below.

Terraform documentation describes a different method to obtain the Tenant ID that involves showing the OAuth Authorization Endpoint for the application that you just created and copying the GUID from the URL. I think, their approach is a bit more error prone but if you feel comfortable in your Copy/Paste abilities you may want to give it a try.

I hope that by describing this a bit convoluted registration process you will be able to be more productive managing your resources on Azure.