For a while already we have been working with a large enterprise client, helping them to migrate their on-premise workloads to the cloud. Of course, as added value to the process, they are also migrating their legacy development processes to the modern, better, agile DevOps approach. And of course, they have built a modern Continuous Integration/Continuous Delivery (CI/CD) pipeline consisting of Bitbucket, Jenkins, Artifactory, Puppet and some relevant testing frameworks. “It is all great!”, you would say “what is the problem?”.

Because I am on all kind of mailing lists for this client, I noticed recently that my dedicated email inbox started getting more and more emails related to the CI/CD pipeline. Things like unexpected Jenkins build failures, artifacts cannot be downloaded, server outages and so on and so on. You already guessed it – emails that report problems with the CI/CD pipeline and prevent development teams from doing their job.

I don’t want to go into details what exactly went wrong with this client but I will only say that year ago when we designed the pipeline there were few things in the design that never made it into the implementation. The more surprising part for me though is, that if you search on the internet for CI/CD pipelines you will get the exact picture of what our client has in production. The problem is that all the literature about CI/CD is narrowly focused on how the code is delivered to its destination and the operational, security and business side of CI/CD pipeline is completely neglected.

Let’s step back and take a look at how a CI/CD pipeline is currently implemented in the enterprise. Looking at the picture below there are few typical components included in the pipeline:

  • Code repository – typically this is some Git flavor
  • Build tools like Maven
  • Artifacts repository – most of the times this is Nexus or Artifactory
  • CI automation or orchestration server – typically Jenkins in the enterprise
  • Configuration management and deployment automation tools like Puppet, Chef or Ansible
  • Various test automation tools depending on the project requirements and frameworks used
  • Internal or external cloud for deploying the code

Typical CI/CD Pipeline Components

The above set of CI/CD components is absolutely sufficient to getting the code from the developer’s desktop to the front-end servers and can be completely automated. But those components do not answer few very important questions:

  • Are all components of my CI/CD pipeline up and running?
  • Are the components performing according to my initial expectations (sometimes documented as SLAs)?
  • Who is committing code, scheduling builds and deployments?
  • Is the feature quality increasing from one test to another or is it decreasing?
  • How much each component cost me?
  • What is the overall cost of operating the pipeline? Per day, per week or per month? What about per deployment?
  • Which components can be optimized in order to achieve faster time to deployment and lower cost?

Too many questions that none of the typical components listed above can provide a holistic approach to answer. Jenkins may be able to send you a notification if a particular job fails but it will not tell you how much the build costs you. Artifactory may store all your artifacts but will not tell you if you are out of storage or give you the cost of the storage. The test tools can give you individual test reports but rarely build trends based on feature or product.

Hence, in our implementations of a CI/CD pipelines we always include three additional components as shown in the picture below:

  • Monitoring and Alerting Component that is used to collect data from each other component of the pipeline. The purpose of this component is to make sure the pipeline is running uninterrupted as well as to collect data used for the business reporting. If there are some anomalies alerts are sent to the affected parties
  • Security Component used not only to ensure consistent policies for access but to also provide auditing capabilities if there are requirements like HIPAA, PCI, SOX etc
  • Business Dashboarding and Reporting Component used for providing financial and project information to business users and management

Advanced CI/CD Pipeline Components

The way CI/CD pipelines are currently designed and implemented is yet another proof that we as technologists neglect important aspects of the technologies we design – security, reliability, and business (project and financial) reporting are very important to the CI/CD pipeline users and we should make sure that those are included in the design from get go and not implemented as an afterthought.

Recently I had to design the backup infrastructure for cloud workloads for a client in order to ensure that we comply with the Business Continuity and Disaster Recovery standards they have set. However, following traditional IT practices in the cloud quite often poses certain challenges. The scenario that we had to satisfy is best shown in the picture below:

Agent-Based Backup Architecture

The picture is quite simple:

  1. Application servers have a backup agent installed
  2. The backup agent submits the data that needs to be backed up to the media server in the cloud
  3. The cloud media server submits the data to the backup infrastructure on premise, where the backups are stored on long-term storage according to the policy

This is a very standard architecture for many of the current backup tools and technologies.

Some of the specifics in the architecture above are that:

  • The application servers and the cloud media server exist in different accounts or VPCs if we use AWS terminology or virtual networks or subscriptions if you consider Microsoft Azure terminology
  • The connectivity between the cloud and on-premise is established through DirectConnect or ExpressRoute and logically those are also considered separate VPCs or virtual networks

This architecture would be perfectly fine if the application servers were long-lived, however, we were transitioning the application team to a more agile DevOps process, which meant that they will use automation to replace the application servers with every new deployment (for more information take a look at the Blue/Green Deployment White Paper published on our company’s website). This, though, didn’t fit well with the traditional process that the IT team, managing the on-premise Netbackup infrastructure, uses.  The main issue was that every time one of the application servers gets terminated, somebody from the on-prem IT team will get paged for failed backup, and trigger an unnecessary investigation.

One option for solving the problem, presented to us by the on-premise IT team, was to use traditional job scheduling solutions to trigger script that will create the backup and submit it to the media server. This approach doesn’t require them to manually whitelist the IP addresses of the application server into their centralized backup tool, and will not generate error event but involved additional tools that would require much more infrastructure and license fees. Another option was to keep the old application servers running longer so that the backup team has enough time to remove the IPs from the white-list. This, though, required manual intervention on both sides (ours and the on-prem IT team) and was prone to errors.

The approach we decided to go with required a little bit more infrastructure but was fully automatable and was relatively cheap compared to the other two options. The picture below shows the final architecture.

The only difference here is that instead of running the backup agents on the actual application instances, we run just one backup agent on a separate instance that has an unlimited lifespan and doesn’t get terminated with every release. This can be a much smaller instance than the ones used for hosting the application, which will save some cost, and its role is only for hosting the backup agent, hence no other connections to it should be allowed. The daily backups for the applications will be stored on a shared drive that is accessible on the instance hosting the agent, and this shared drive is automatically mounted on the new instances during each deployment. Depending on whether you deploy this architecture in AWS or Azure, you can use EFS or Azure Files for the implementation.

Here are the benefits that we achieved with this architecture:

  • Complete automation of the process that supports Blue/Green deployments
  • No changes in the already existing backup infrastructure managed by the IT team using traditional IT processes
  • Predictable, relatively low cost for the implementation

This was a good case study where we bridged the modern DevOps practices and the traditional IT processes to achieve a common goal of continuous application backup.

It is surprising to me that every day I meet developers who do not have a basic understanding of how computers work. Recently I got into an argument of whether this is necessary to become a good cloud software engineer, and the main point of my opponent was that “modern languages and frameworks take care of lots of stuff behind the scenes, hence you don’t need to know about those”. Although the latter is true, it does not release us (the people who develop software) from the responsibility to think when we write software.

The best analogy I can think of is the recent stories with Tesla’s autopilot – because it is called “autopilot” doesn’t mean that it will not run you into a wall. Similar to the Tesla’s driver, as a software engineer, it is your responsibility to understand how your code is executed (ie where your car is taking you), and if you don’t have a basic understanding how computers work (ie how to drive a car or common sense in general:)), you will not know whether it runs well.

If you want to become an advanced Cloud Software Engineer, there are certain things that you need to understand in order to be able to develop applications that run on multiple machines in parallel, across many geographical regions, using third party services and so on. Here is an initial list of things that, I believe, is essential for Cloud Software Engineers to know.

First of all, every Software Engineer (Cloud, Web, Desktop, Mobile etc.), needs to understand the fundamentals of computing. Things like numeric systems and character encoding, bits and bytes are essential knowledge for Software Engineers. Also, you need to understand how operating on bits and bytes is different from operating on decimal digits, and what issues you can face.

Understanding the computer hardware is also important for you as a Software Engineer. In the ages of virtualization, one would say that this is nonsense but knowing that data is fetched from the permanent storage and stored in the operating memory before your application can process it may be quite important for data-heavy applications. Also, this will help you decide what size virtual machine you need – one with more CPU, more memory, more local storage or all of the above.

Basic knowledge of Operating Systems, and particularly processes, execution threads, and environment settings are another thing that Software Engineers must learn, else how would you be able to implement or configure an application that supports multiple concurrent users.

Networking basics like IP addresses, Domain Name Service (DNS), routing and load balancing are used every day in the cloud. Not knowing those terms and how they work is quite often the reason web sites and services go down.

Last, but now least security is very important in order to protect your users from malicious activities. Things like encryption and certificate management are must-know for Software Engineers developing cloud-based applications.

You don’t need to be an expert in the topics above, in order to be a good Cloud Software Engineer, but you need to be able to understand how each of the topics above impacts your application, and tweak your code accordingly. In the next few posts, I will go over the minimum knowledge that you must obtain in order to have a solid background as Cloud Software Engineer. For more in-depth information you can research each one of the topics using your favorite search engine.