Blog

In my last post How to Define Your Enterprise Cloud Strategy?, I wrote about the thought process you need to do when starting to think about your cloud strategy. But, how can you make your cloud strategy document engaging and worth remembering? If you have ever attended a presentation skills training, you have learned that stories are much easier to remember than raw facts (I should tell that to my Biology teacher:)). So, let’s try to think of a story that we can tell in our Cloud Strategy document. A story that is familiar to the audience and they can relate.

For our project, I decided to go with the typical “Learn a New Technology” story. Here is how it goes‚Ķ

Cloud Overview

Start with an overview of the current cloud landscape including deployment and service models, market trends and cloud vendors. The goal of this section is for the reader to learn the basic terminology, get familiar with the cloud computing market and vendors, understand the trends and in general, be able to speak the cloud language that you will use later on in the document.

Things that you may want to cover in this section are:

  • Definition of cloud computing
  • Service models (traditional and emerging)
  • Deployment models
  • Overview of the vendors and the services they offer
  • Market trends

This section should give a good background for the reader to build upon.

Cloud Deep Dive

Now, that you covered the basics, you can jump into a deeper analysis of the cloud technologies. The goal here is to provide the reader with enough information about the technologies for her to be able to make an informed decision about the cloud.

You can pick different pivots here. One I like is using the service models and covering the following topics:

  • History and maturity of each service model (IaaS, PaaS, etc.)
    Where it comes from, how has it evolved, and where is it now? You may even want to use Gartner’s Magic Quadrant for maturity here
  • Advantages and disadvantages of each service model
    Don’t use only the technical answer to this question. Think from other points of view like finance, people or market
  • Applicability of the model
    You don’t need to go into too many details, but you may want to hint into when to use each service morel. The typical answer here is: “Use IaaS for legacy applications and PaaS for new development.” You may want to talk about the gotchas of each service model (portability anyone?)
  • Industry adoption
    In the previous section, you hinted about the trends, but the latest and greatest may not have been adopted widely. In this section, you can provide deeper analysis of the service models adoption
  • Pricing models
    Last but not least describe the pricing details for each service model and do some basic cost analysis between models

With this information, your reader should be able to feel confident in her knowledge about the cloud and prepared to participate in technical discussions (or continue reading your Cloud Strategy Document).

Vendors Overview

No cloud strategy can go without giving a comparison of the leading cloud vendors. Your readers should get a good overview of the main players in the cloud market, understand their market share, strengths and weaknesses and the services they offer.

Pick and choose your vendors here depending on what your enterprise is looking for. You may want to go deep into the top 3 choices and provide thorough analysis, but you should also mention few other for comparison (in meetings, you should always anticipate the question: “What about this vendor? This service of theirs is very good!” from somebody who did some search on Google).

Here are the things you may want to cover:

  • Market share and trend for the last three years and projections for the next 3-5 years
    You need to prove even to yourself that your choice of vendor(s) is on the right track
  • Feature richness
    Do they have everything you need to put your whole application portfolio on their cloud?
  • Maturity
    Are most of their features in beta or are they generally available? Do you feel comfortable using beta features for your production workloads?
  • Regions availability
    Even if you sell only in one country, you may want to have some redundancy. Do they offer datacenters where you are and where your customers are?
  • On-Premise connectivity
    Even startups have their own data centers (sometimes you may call them closets), and you will need to establish secure and very often fast connectivity to your premises.
  • Legacy migration support
    Traditional enterprises need to migrate a lot of legacy applications. You will certainly need vendor support when it comes to migration. The more tools they offer, the easier for you will be to get on their cloud
  • Cost comparison
    The price wars are still going on (as of the time of this publishing), but you should have a good sense of the way vendors charge for their services – believe me they differ
  • Compliance
    It is hard to believe that there is an industry right now that doesn’t have to comply with any regulations. With this in mind, give an overview of the cloud vendors’ compliance certifications
  • Operational capabilities
    Having the capability to automate workloads is a very small part of the operational capabilities one vendor should enable. Monitoring, alerting, reporting, and support are an essential part of the operational capabilities for the cloud
  • Innovation trends for each vendor
    You can go wild here. Starting with their R&D budget to counting the number of new services (or even features) they release quarterly can give you a hint of how good are they in innovation. (Be careful with counting services and features though – sometimes vendors bundle few existing services under a new marketing name :))
  • Weak areas
    Despite their maturity, each cloud vendor will have some weak areas your audience should be made aware of. Whether it is the newcomer or the incumbent in the market, knowing their weaknesses will help you identify the risks in your strategy
  • Unique services
    Each vendor will have services that they excel in. Sooner or later you will need to think about multi-cloud solutions, and this will help you to better position your strategy

A comprehensive overview of the cloud vendors you choose will help your audience make better choices for their workloads and will support your further development of the strategy. If you are looking for an easy way to compare the services between different cloud vendors, I’ve started a Top 5 Cloud Vendors Service Comparison Page on GitHub that I plan to populate soon and keep up to date (feedback and contributions are welcome :))

Starting with the basics is always the first step everybody takes when learning. Giving an overview of the cloud terminology, market and players will help your audience easily understand and get onboard with the rest of your cloud strategy. In the next post, I will go over the next logical step in the learning process – what application patterns to use for developing your cloud application.

Recently, I was asked to consult on the creation of a Cloud Strategy document for a client of ours. The effort has been going on for a couple of months already but it seemed that instead getting to convergence, the scope of the document was constantly growing – adding new tools and technologies, creating separate documents for certain areas and so on and so on. The strategy was suffering from the same sickness any other technology project was – scope creep.

As always, when we, the techies, concentrate on technology, we lose sight of the actual problems and start blowing things out of proportion. And all this, just because tech is cool ūüôā

This effort prompted the idea to create a series of posts and lay out my thoughts about an enterprise cloud strategy. Also, with the research I regularly do for clients, I thought, it will be good to take all my notes out of the OneNote and share it widely for feedback and suggestions.

So, let’s get started with what a strategy is and what should it entail. First of all, any “strategy” that doesn’t get people on-board is a waste. Also, any strategy that is too abstract and does not give people clear idea how to implement is doomed to fail (or it should be called vision and not strategy). Hence, I think, a technology strategy (not only enterprise cloud one) should have the following characteristics:

  • Defines an audience
    It should be clear who are the people who will need to follow or implement this strategy
  • Gives background on the technology
    This is a dumbed-down explanation of the technology including an introduction of the basic terminology and the technical jargon
  • It is easy to understand and follows some common sense logic
    The easiest way to get people onboard is to explain things in a simple, logical fashion
  • Explains the (business) value
    If there is no value, then why bother implementing it
  • Provides guidance for implementation
    If there is no guidance how to implement the strategy, everyone will have their own interpretation and you will end up with many implementations
  • Provides reasoning of choices
    If the strategy recommends certain path of implementation, it should also explain why this path is chosen

With all this in mind, let’s go through the thought process for an enterprise cloud technology strategy.

Who will be the audience for your cloud strategy? The most common answer is: “The IT team!”, because they are the ones who will implement it. While this is correct, it is only small part of the answer. Unfortunately, most of the enterprise cloud strategies are defined from the IT point of view and fail to get other actors onboard. Now, here is a little bit more comprehensive list:

  • IT team
  • Security team (often folded into IT however rarely thought of)
  • Line of Business (LOB) Application Development teams
  • Business Owners (LOB) including leadership and Program/Project/Product managers

Anybody else? Well, you can add Finance and Accounting because cloud uses a completely new business model that may change the way company’s revenue and assets are calculated. You can also add HR because they will need to change their approach to hiring tech personnel. And you can add your Sales and Marketing teams to show them how fast will products reach the market by moving to the cloud.

Now, that you have a better understanding of who your audience is, explaining what cloud is may not be so trivial. When you explain the technology, you should refrain from using strong technology language full of acronyms and techy jargon. Even tech savvy people are often confused by the abundance of specialized terms and technologies that sprawl lately. Start with the basics – explain what the cloud is, what is it good for, deployment models, service models, who are the vendors and lay the ground for everybody to understand.

Next comes the flow – will you start with the technology and tools (and risk to lose most of your audience from the beginning) or will you start with something that everybody understands like a business process or an application? The best way to approach it, I think, is to look at the types of applications available within your company and walk your audience through the journey of implementing those in the cloud. Throughout this journey, talk about the value first, then outline the options (tools, vendors, technologies) and provide a guidance which one to choose. Giving just choices is not a strategy, it is a market research. It is important to make choices when you define your strategy – this way you avoid the ambiguity and don’t let people wonder which way they should go. But don’t forget to justify your choices, else you risk to

Last but not least, look at the new application trends and types and add those to the list to show the audience what the opportunities are. It will be easy to get everyone onboard if they see what will they miss if they don’t implement the strategy.

As everything else, your strategy should follow a story. In my next post, I will talk about the story I find easy to follow when defining an enterprise cloud strategy. It will be by no means the only story but will be one that can bring things together.

In my last post, I’ve discussed the basics of the secure Web protocol HTTPS (or HTTP-over-SSL), which gives you some idea what HTTPS is and why it is important. In this post, though, I would like to go into details how the SSL (or TLS, I will use the latter in this post so that the new name sticks with you) part of HTTPS works, what are cipher suits, and why are those important. This post is targeted more towards software developers or IT professionals, although I will try to use language simple enough for everybody to understand.

The first thing I would like to ask you to do is to go to https://www.ssllabs.com/ssltest/analyze.html and get the SSL report for your website (or any other site for that matter). Keep the page open because I will refer to portions of it throughout the post. If scroll down the page, you will see a section Configuration and subsection about cipher suites – this is what I will explain. I have analyzed my own domain toddysm.com on the SSLLabs site and will use it as an example. One of the cipher suites that you will see for my domain (and most probably for yours) is this one TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384, which will be the concrete example that we will decipher.

Now, let’s look at the so-called TLS handshake that happens when you try to load an HTTPS page from a website. Although the actual TLS 1.2 RFC-5246 has a pretty detailed explanation of the handshake protocol, I find IBMs Knowledge Center overview of TLS handshake much simpler and clearer to understand. Let’s look deeper into the steps one by one (italicized steps text is a credit to IBM’s Overview of SSL or TLS Handshake):

TLS Protocol Handshake Sequence

Step 1: The client sends a “client hello” message that lists information like TLS version and the cipher suites supported by the client in the client’s order of preference. The message also contains a random byte string that is used in subsequent computations.

The most important thing to note is that the information sent as part of the “client hello” message is not encrypted (i.e. it is in plain text). Thus, if somebody is sniffing on your traffic, they will know two pieces of information: 1.) the cipher suites your browser supports (in order of preference) and 2.) the random byte string used to create the master secret. It is also unfortunate that not all modern browsers (or TLS clients) tell you what is their “order of preference”, which is a shame because the negotiated cipher suite may not be the most secure one and having the ability to change the order of preference may increase your security.

Step 2: The server responds with a “server hello” message that contains the cipher suite chosen by the server from the list provided by the client, the session ID, and another random byte string. The server also sends its digital certificate.

Once again, the message exchanged in this step is not encrypted. For somebody listening to your traffic, all this information may be of some value.

Step 3: The client verifies the server’s certificate.

This is done through a trusted third party called Certificate Authority. Another option is if the server certificate is already added as a trusted certificate on the client’s machine.

Step 4: The client sends the random byte string that enables both the client and the server to compute the secret key to be used for encrypting subsequent message data. The random byte string itself is encrypted with the server’s public key.

The first thing to note here is that this is the first message that is encrypted (eventually) – in this case with the server’s public key and only the server can decrypt it. The next thing to note is the importance of the yet another random byte string. This random byte string is also called pre-master secret and it is used together with the previous two random values to generate the so-called master secret. Depending on the sophistication of the algorithm used to generate this pre-master secret, your connection with the server may or may not be vulnerable.

Let’s look back at our¬† SSLLabs scan and the different ciphers that your server supports (see picture below).

List of supported cipher suites

The example I took above is the fifth from the list and means the following (credits to Daniel Szpisjak for his explanation of TLS-RSA vs TLS-ECDHE-RSA vs static DH on StackExchange):

  • Ephemeral Elliptic-Curve Diffie-Hellman (aka ECDHE) is the algorithm used to generate key pairs on both the client and the server. The public keys are the random values exchanged in steps 1 and 2 above.
    One important thing to note here is that the use of the ECDHE algorithm provides the so-called forward secrecy, which means that if future communication channels between those parties are compromised, the keys cannot be used to decrypt previous conversations.
  • The pre-master secret is this case is NOT the shared secret key as per the Diffie-Hellman algorithm as some may think. The shared secret key is never sent over the wire. This random number is encrypted using the server’s RSA public key (according to IBM’s explanation) and is used to generate an¬†uniform shared secret using a hashing algorithm. The encryption, in this case, is not so important because this is not the actual key used for encryption.
    The RSA part of the cipher suite also denotes two more things:

    • The public key type used to authenticate the server in step 3 (and the client if client authentication is required – part of step 4)
    • And the public key type used to sign the ECDHE public keys during the exchange

In this step the client uses the following information to generate the master secret:

  • Clients private key
  • Server’s public key

Then the client uses the random value to generate an uniform hash of the master secret that is used to encrypt the traffic. In our example, the hash is generated using SHA384 (the last part of the cipher suite).

On the other side, the server uses the following information to come up with the same master secret:

  • Client’s public key
  • Server’s private key

Then the server uses the random value to generate the same uniform hash of the master secret that is used to encrypt the traffic.

Important to note is that using the ECDHE algorithm for key exchange, both the client and the server generate the master secret independently and do not share it over the wire, which limits the opportunities for sniffing the secret.

Now, let’s move forward and look at the remaining steps in the handshake and decipher the rest of the cipher suite‚Ķ

Step 5: If the TLS server sent a “client certificate request”, the client sends a random byte string encrypted with the client’s private key, together with the client’s digital certificate, or a “no digital certificate alert”.

This step is used only for authentication purposes for the server to make sure that it communicates with the correct client. When browsing this step is not required because the browsers are not authenticated, however, if you develop services that need to authenticate their clients using certificates, this is an important step.

Note that the cipher suite is already agreed upon and in this step, the client should send a public key in the previously agreed format. In our case RSA.

Step 6: The TLS server verifies the client certificate.

This step is exactly the same as step 3 but on the server side.

Step 7: The TLS client sends the server a “finished” message

Step 8: The TLS server sends the client a “finished” message

Those two steps just confirm from both sides that the handshake is complete and both parties can start exchanging traffic securely.

Step 9: For the duration of the TLS session, the server and the client can now exchange messages that are symmetrically encrypted with the shared secret key.

The communication between the TLS server and the TLS client is now encrypted using the symmetric key (the master secret hash) that both parties generated independently. For the purpose of this communication, AES-256 symetric key encryption algorithm is used in our example. This is contrary to the popular belief that the traffic is encrypted with the public key of the server on the client side and decrypted with the private key on the server side.

To complete the example, we need to explain two more parts:

  • GCM is the mode of operation for the symmetric key cryptographic block ciphers like AES. Another mode of operation is HMAC. Both are used to provide both data authenticity and confidentiality but GCM is proven to be quite efficient and it is widely used.
  • The SHA384 is the Secure Hashing Algorithm used to hash every message in the above mode of operation in order to ensure the¬†integrity of the message.

Keep in mind that the flow above describes the superset of steps for establishing TLS secure channel. Depending on the TLS version and agreed upon cipher suite some information may be ignored during the exchange, which may make your communication vulnerable.

If you want to get really technical how the handshake works, you can read Moserware’s post that dates back to 2009 http://www.moserware.com/2009/06/first-few-milliseconds-of-https.html (not a lot have changed though) or watch the Khan Academy video on Diffie-Hellman Key Exchange.

With all the cybersecurity reports that we hear about lately, increasing our awareness and knowledge of online security raises higher on the task list. Unfortunately, by using crazy acronyms and fancy words cybersecurity experts do not make it easy for normal people to understand online security. Hence, I thought, a series of posts that explain the online security concepts can be beneficial not only for normal people but also for some IT professionals.

In this post, I would like to explain the basics of secure communication on the Web and the terminology around them as well as give you some guidance why you should communicate securely on the Internet.

Let’s start with the acronyms and some terminology:

  • A protocol is a procedure or sequence of steps used to communicate over the Internet (or in general). For example, it is a “protocol” when you introduce yourself in a business meeting in Western countries to 1.) say your name 2.) do a handshake and (optional) 3.) hand over your business card.
  • HTTP stands for HyperText Transport Protocol and is the main protocol used for the exchange of information on the Web. Note that the Web is not the Internet! Although more and more of the communication on the¬†Internet goes over HTTPS, there are other applications on the Internet (like email for example) that do not use HTTP as a protocol for communication.¬†Data sent over HTTP is not secure and everybody who can sniff (listens) to your traffic can read what you are exchanging with the server. It is like chatting on the street with your friend – everyone passing by can hear and understand what you are talking about.
  • HTTPS stands for Secure HTTP (or also called HTTP-over-SSL, which is the correct name) and is (as the name implies) the secure version of HTTP. Data sent over HTTPS is encrypted and only the sender (your browser) and the receiver (the server) can understand. It is like chatting on the street with your friend but in a language that two of you have invented and only you two can understand. Keep in mind that everybody else can still hear you. If you don’t want everybody else to understand what you are talking about, you use this language.
  • SSL stands for Secure Socket Layer and it is used to secure the communication for many different protocols on Internet (not only HTTP, which is used for browsing). Using the street chat analogy, imagine that instead only you and your friend, you have two more friends with you. You (HTTP) whisper a message to one of your friends in plain English but only she can hear it. She (SSL) then uses a special language that she and the second of your friends have invented (and only two of them can understand! Note that even you don’t understand the language they are speaking) and communicates your message to that second friend of yours (SSL again). The second friend then translates the message from the special language into plain English and then whispers it to your third friend (HTTP) so quietly that only he can hear. Here a visual of that process:
    HTTPS Analogy
  • TLS stands for Transport Layer Security and it is the new, better, more secure, more advanced (and many more superlatives you will hear) version of SSL.

From the explanations above, it may be obvious to you why you should use HTTPS when you browse the Web but if it is not, think about the following. Sites are collecting all kind of information about you. They use it to provide more targeted information (or advertisement) but they also¬†integrate with third-party sites including advertisement sites, Facebook, Twitter,¬†and Google (the latter may also be used for authentication purposes). The information they collect includes but is not limited to things like your location, IP address, browsing patterns, laptop details and quite often information that is used to automatically sign you into certain services. This information is automatically exchanged between your browser and the website you are visiting without your knowledge. Thus, if the website you are visiting doesn’t use HTTPS protocol, your information will be easily readable by every hacker that monitors your Web traffic.

If you own a website, you should care even more about HTTPS and make sure you configure your site to use only¬†the HTTPS protocol for communication. The reason is that the browser vendors are starting to explicitly notify users if the site they are visiting doesn’t support HTTPS and mark it as insecure.¬†In addition, Google will start ranking sites that do not support HTTPS lower in their search results, which may have a significant impact on your business.

With this, I hope you understand the importance of HTTPS and the implications of not using it. In the next post, targeted more to IT professionals and software developers, I plan to go more technical and explain how the TLS encryption works.

Kubernetes website has quite useful guides for troubleshooting issues with your application, service or cluster¬†but sometimes those may not be very helpful, especially if your containers are constantly failing and restarting. Quite often the problem for that can be a permissions issue. The overall goal is that containers run with least privileges, however, this doesn’t work well if you have mounted persistent volumes. In this case, you will need to use the containers in privileged mode or use runAsUser¬†in the security context.

While deploying Elasticsearch on Kubernetes and trying to use Azure File as a persistent volume for my containers, I, of course, encountered this issue yet again and started thinking of a way to figure out what is going on. It would be nice if the scheduler did have an option to pause (like, in the sense of debugging) the container for troubleshooting purposes and allow the developer to connect to the container and look around.

Well, the solution is quite simple. You just need to “pause” the start of the container by yourself. The simplest way to do that is to add the sleep command as a start command for your container. This is done in the containers section of your deployment YAML as follows:

command:
- "sleep"
args:
- "300"

This way, the scheduler will pause the start of the container for 5 mins (300 seconds) and you can easily attach to the container or connect to it by executing a supported shell. For example:

$ kubectl exec -it [your-pod-id] bash

From there on, you can execute the command starting your service as it is set in the image’s Dockerfile.

Simple trick but can save you some time.

 

After yet another cloud outage yesterday (see AWS’s S3 outage was so bad Amazon couldn’t get into its own dashboard to warn the world) the world (or at least its North American part) once again went crazy how dangerous the cloud is and how you should go build your own data center because you know better what is good for your business.

Putting aside all the hype as well as some quite senseless social media posts about AWS SLAs, here is our thought process for developing highly available cloud services or more importantly making conscious¬†decisions what our services’ SLAs¬†should be.

I will base this post on my customer experience with Docker that was impacted by the outage yesterday but also walk you through the thought process for our own services. Without knowing Dockers business strategy I will speculate a bit but my goal is to walk you through the process and not define Docker’s HA strategy. For those who are not familiar what the problem with Docker was, Docker’s public repository is hosted on S3 and was not accessible during the outage.

The first thing we look at is, of course, the business impact of the service. Nothing new here! Thinking about Docker’s registry outage here are my thoughts:

  • An outage may impact all customer deployments that use Docker Hub images. Theoretically, this is every one of Docker’s customers. Based on this only the impact can be huge
  • On the other side though Docker’s enterprise (small and big) customers customize the images they use and most probably store them in private repositories. Docker’s outage doesn’t necessarily impact those private repositories, which means that we can lower the impact
  • Docker is a new company though and their success is based on making developers happy. Those developers may be constantly hacking something (like for example my case yesterday:)) and using the public repository. Being down will make the developers unhappy and¬†will have an¬†impact on Docker’s PR
  • In addition, Docker wants to establish itself as THE company for the cloud. Incidents like yesterday’s may have a¬†negative impact on this inspiration mainly from PR and growth point of view

With just those simple points, one can now make a conscious decision that the impact of Docker’s public repository being down is most probably high. What to do about it?

The simplest thing you can do in such a situation is to set the expectations upfront. Calculate a realistic availability SLA and publish it on your site. Unfortunately, looking at Docker Hub’s site I was not able to find one. In general, I think cloud providers bury their SLAs so deep that it is hard for customers to find them. Thus, people search on Google or Bing and start citing the first number they find (relevant or not), which makes the PR issue even worse. I would go even further – I would publish not only the 9s of my SLA but also what those 9s equate to in time, and whether this is per week, month or year. Taking, for example, the Amazon’s S3 SLA, after being down for approximately 3 hours yesterday, if we consider it annually, they are still within their 8h 45min allowed downtime.

Now, that you made sure that you have a¬†good answer to your customers, let’s think how can you make¬†sure that you keep those SLAs intact. However, this doesn’t mean that you should go ahead and overdesign your infrastructure¬†and spin up a multimillion project that will provide redundancy for¬†every component of every application you manage.¬†There were a¬†lot of¬†voices we’ve heard yesterday calling for you to start multi-cloud deployments immediately.¬†You could do that but is this the right thing?

I¬†personally like to think about this problem gradually and revisit the HA strategy on a regular basis. During those reviews,¬†you should look at the business requirements as well as what is the next logical step to make improvements. Multi-cloud can be in your strategy long term but this is certainly much bigger undertaking than providing quick HA solution with your current provider. In yesterday’s incident, the next logical step for Docker would be to have a second copy of the repository in US West and ability to quickly switch to it if something happens with US East (or vice versa). This is a small incremental improvement that will make a¬†huge difference for the customers and boost Docker’s PR because they can say: “Look! we host our repository on S3 but their outage had minimal or no impact on us. And, by the way, we know how to do this cloud stuff.” After that, you can think about multi-cloud and how to implement it.

Last, but not least your HA strategy should be also tied to your monitoring, alerting, remediation but also to your customer support strategy. Monitoring and alerting is clear – you want to know if your site or parts of it are down and take the appropriate actions as described in your remediation plan. But why, your customer support strategy? Well, if you haven’t noticed – AWS Service Dashboard was also down yesterday. The question comes up, how do you notify your customers of issues with your service if your standard channel is also down? I know that lot of IT guys don’t think of it but Twitter turns out a pretty good communication tool – maybe you should think of it next time your site is down.

Developing solid HA strategy¬†doesn’t need to be a big bang approach. As everything else, you should ask good questions, do incremental steps, fail and learn. And most importantly, take responsibilities for your decision and don’t blame the cloud for all bad things that happen with your site.

For a while already we have been working with a large enterprise client, helping them to¬†migrate¬†their on-premise workloads to the cloud. Of course, as added value to the process, they are also migrating their legacy development processes to the modern, better, agile DevOps approach. And of course, they have built a modern Continuous Integration/Continuous Delivery (CI/CD) pipeline consisting of Bitbucket, Jenkins, Artifactory, Puppet and some relevant testing frameworks. “It is all great!”, you would say “what is the problem?”.

Because I am on all kind of mailing lists for this client, I noticed recently that my dedicated email inbox started getting more and more emails related to the CI/CD pipeline. Things like unexpected Jenkins build failures, artifacts cannot be downloaded, server outages and so on and so on. You already guessed it – emails that report problems with the CI/CD pipeline and prevent development teams from doing their job.

I don’t want to go into details what exactly went wrong with this client but I will only say that year ago when we designed the pipeline there were few things in the design that never made it into the implementation. The more surprising part for me though is, that if you search on the internet¬†for CI/CD pipelines you will get the exact picture of what our client has in production. The problem is that all the literature about CI/CD is narrowly focused on how the code is delivered to its destination and the operational, security and business side of CI/CD pipeline is completely neglected.

Let’s step back and take a look at how a CI/CD pipeline is currently implemented in the enterprise. Looking at the picture below there are few typical components included in the pipeline:

  • Code repository – typically this is some Git flavor
  • Build tools like Maven
  • Artifacts repository – most of the times this is Nexus or Artifactory
  • CI automation or orchestration server – typically Jenkins in the enterprise
  • Configuration management and deployment automation tools like Puppet, Chef or Ansible
  • Various test automation tools depending on the project requirements and frameworks used
  • Internal or external cloud for deploying the code

Typical CI/CD Pipeline Components

The above set of CI/CD components is absolutely¬†sufficient to getting the code from the developer’s desktop to the front-end servers and can be completely automated. But those components do not answer¬†few very important questions:

  • Are all components of my CI/CD pipeline up and running?
  • Are the components performing according to my initial expectations (sometimes documented as SLAs)?
  • Who is committing code, scheduling builds and deployments?
  • Is the feature quality increasing from one test to another or is it decreasing?
  • How much each component cost me?
  • What is the overall cost of operating the pipeline? Per day, per week or per month? What about per deployment?
  • Which components can be optimized in order to achieve faster time to deployment and lower cost?

Too many questions that none of the typical components listed above can provide a holistic approach to answer. Jenkins may be able to send you a notification if a particular job fails but it will not tell you how much the build costs you. Artifactory may store all your artifacts but will not tell you if you are out of storage or give you the cost of the storage. The test tools can give you individual test reports but rarely build trends based on feature or product.

Hence, in our implementations of a CI/CD pipelines we always include three additional components as shown in the picture below:

  • Monitoring and Alerting Component that is used to collect data from each other component of the pipeline. The purpose of this component is to make sure the pipeline is¬†running uninterrupted as well as to collect data used for the business reporting. If there are some anomalies alerts are sent to the affected¬†parties
  • Security Component used not only to ensure consistent policies for access but to also provide auditing capabilities if there are requirements like HIPAA, PCI, SOX etc
  • Business Dashboarding and Reporting Component used for providing financial and project information to business users and management

Advanced CI/CD Pipeline Components

The way CI/CD pipelines are currently designed and implemented is yet another proof that we as technologists neglect important aspects of the technologies we design – security, reliability, and business (project and financial) reporting are very important to the CI/CD pipeline users and we should make sure that those are included in the design from get go and not implemented as an afterthought.

Recently I had to design the backup infrastructure for cloud workloads for a client in order to ensure that we comply with the Business Continuity and Disaster Recovery standards they have set. However, following traditional IT practices in the cloud quite often poses certain challenges. The scenario that we had to satisfy is best shown in the picture below:

Agent-Based Backup Architecture

The picture is quite simple:

  1. Application servers have a backup agent installed
  2. The backup agent submits the data that needs to be backed up to the media server in the cloud
  3. The cloud media server submits the data to the backup infrastructure on premise, where the backups are stored on long-term storage according to the policy

This is a very standard architecture for many of the current backup tools and technologies.

Some of the specifics in the architecture above are that:

  • The application servers and the cloud media server exist in different accounts or VPCs if we use AWS terminology or virtual networks or subscriptions if you consider Microsoft Azure terminology
  • The connectivity between the cloud and on-premise is established through DirectConnect or ExpressRoute and logically those are also considered separate VPCs or virtual networks

This architecture would be perfectly fine if the application servers were long-lived, however, we were transitioning the application team to a more agile DevOps process, which meant that they will use automation to replace the application servers with every new deployment (for more information take a look at the¬†Blue/Green Deployment White Paper published on our company’s website). This, though, didn’t fit well with the traditional process that the IT team, managing the on-premise Netbackup infrastructure, uses. ¬†The main issue was that every time one of the application servers gets terminated, somebody from the on-prem IT team will get paged for failed backup, and trigger an¬†unnecessary¬†investigation.

One option for solving the problem, presented to us by the on-premise IT team, was to use traditional job scheduling solutions to trigger script that will create the backup and submit it to the media server. This approach doesn’t require them to manually whitelist the IP addresses of the application server into their centralized backup¬†tool, and will not¬†generate error event but involved additional tools that would require much more infrastructure and license¬†fees. Another option was to keep the old application servers running longer¬†so that the backup team has enough time to remove the IPs from the white-list. This, though,¬†required manual intervention on both sides (ours and the on-prem IT team)¬†and was prone to errors.

The approach we decided to go with required a little bit more infrastructure but was fully automatable and was relatively cheap compared to the other two options. The picture below shows the final architecture.

The only difference here is that instead of running the backup agents on the actual application instances, we run just one backup agent on a separate instance that has an unlimited lifespan and doesn’t get terminated with every release. This can be a much smaller instance than the ones used for hosting the application, which will save some cost, and its role is only for hosting the backup agent, hence no other connections to it should be allowed. The daily backups for the applications will be stored on a shared drive that is accessible on the instance hosting the agent, and this shared drive is automatically mounted on the new instances during each deployment. Depending on whether you deploy this architecture in AWS or Azure, you can use EFS or Azure Files for the implementation.

Here are the benefits that we achieved with this architecture:

  • Complete automation of the process that supports Blue/Green deployments
  • No changes in the already existing backup infrastructure managed by the IT team using traditional IT processes
  • Predictable, relatively low cost for the implementation

This was a good case study where we bridged the modern DevOps practices and the traditional IT processes to achieve a common goal of continuous application backup.

It is surprising to me that every day I meet developers who do not have a¬†basic understanding of how computers work. Recently I got into an argument of whether this is necessary to become a good cloud software engineer, and the main point of my opponent was that “modern languages and frameworks take care of lots of stuff behind the scenes, hence you don’t need to know about those”. Although the latter is true, it does not release us (the people who develop software) from the responsibility to think when we write software.

The best analogy I can think of is the recent stories with Tesla’s autopilot – because it is called “autopilot” doesn’t mean that it will not run you into a wall. Similar to the Tesla’s driver, as a software engineer, it is your responsibility to understand how your code is executed (ie where your car is taking you), and if you don’t have a basic understanding how computers work (ie how to drive a car or common sense in general:)), you will not know whether it runs well.

If you want to become an advanced Cloud Software Engineer, there are certain things that you need to understand in order to be able to develop applications that run on multiple machines in parallel, across many geographical regions, using third party services and so on. Here is an initial list of things that, I believe, is essential for Cloud Software Engineers to know.

First of all, every Software Engineer (Cloud, Web, Desktop, Mobile etc.), needs to understand the fundamentals of computing. Things like numeric systems and character encoding, bits and bytes are essential knowledge for Software Engineers. Also, you need to understand how operating on bits and bytes is different from operating on decimal digits, and what issues you can face.

Understanding the computer hardware is also important for you as a Software Engineer. In the ages of virtualization, one would say that this is nonsense but knowing that data is fetched from the permanent storage and stored in the operating memory before your application can process it may be quite important for data-heavy applications. Also, this will help you decide what size virtual machine you need Рone with more CPU, more memory, more local storage or all of the above.

Basic knowledge of Operating Systems, and particularly processes, execution threads, and environment settings are another thing that Software Engineers must learn, else how would you be able to implement or configure an application that supports multiple concurrent users.

Networking basics like IP addresses, Domain Name Service (DNS), routing and load balancing are used every day in the cloud. Not knowing those terms and how they work is quite often the reason web sites and services go down.

Last, but now least security is very important in order to protect your users from malicious activities. Things like encryption and certificate management are must-know for Software Engineers developing cloud-based applications.

You don’t need to be an expert in the topics above, in order to be a good Cloud Software Engineer, but you need to be able to understand how each of the topics above impacts your application, and tweak your code accordingly. In the next few posts, I will go over the minimum knowledge that you must obtain in order to have a solid background as Cloud Software Engineer. For more in-depth information you can research each one of the topics using your favorite search engine.

 

Since the sprawl of mobile apps and web services began, the need to create new usernames and passwords for each app or service started to become annoying and as it proved out decreases the overall security. Hence we decided to bet our authentication on the popular social media platforms (Facebook, Twitter, and Google) but wanted to make sure that we protect the authentication tokens on our side. Maybe in a later post I will go into more details what the pros and cons of this approach are but for now, I would like to concentrate on the technical side.

Here are the constraints or internal requirements we had to work with:

  • We need to support¬†multiple social media platforms for authentication¬†(Facebook, Twitter, and Google at minimum)
  • We need to support the web as well as mobile clients
  • We need to pass authentication information to our APIs but we also need to follow the REST guidelines for not maintaining state on the server side.
  • We need to make sure that we validate the social media auth token when it first reaches our APIs
  • We need to invalidate our own token after some time
The flow of events is shown on the following picture and the step-by-step explanations are below.

Authenticate with the Social Media site

The first step (step 1.) in the flow is to authenticate with the Social Media Site. They all use OAuth, however, each implementation varies and the information you receive back differs quite a lot. For details how to implement the OAuth authentication with each one of the platforms, take a look at the platform documentation. Here links to some:

Note that those describe the authentication with their APIs but in general the process is the same with clients. The ultimate goal here is to retrieve an authentication token that can be used to verify the user is who she or he claims to be.

We use Tornado Web server that has built-in authentication handler for the above services as well as generic OAuth handler that can be used for implementing authentication with other services supporting OAuth.

Once the user authenticates with the service the client receives information about the user as well as an access token (step 2. in the diagram) that can be used to validate the identity of the user. As mentioned above, each social media platform returns different information in the form of JSON object. Here are anonymized examples for the three services:

It is worth mentioning some differences related to the expiration times. Depending on how you do the authentication you may receive short-lived or long-lived tokens, and you should pay attention to the expiration times. For example, Twitter may respond with an access token that never expires ("x_auth_expires":"0"), while long-lived tokens for Facebook expire in ~60 days. The expiration time is given in seconds and it is approximate, which means it may not be exactly 60 mins or 60 days but a bit less.

Authenticate with the API

Now, that the user has authenticated with the Social Media site we need to make sure that she also exists in our user database before we issue a standardized token that we can handle in our APIs.

We created login APIs for each one of the Social Media platforms like follows

GET https://api.ourproducturl.com/v1.0/users/facebook/{facebook_user_id}
GET https://api.ourproducturl.com/v1.0/users/google/{google_user_id}
GET https://api.ourproducturl.com/v1.0/users/twitter/{twitter_user_id}

Based on what Social Media service was used to authenticate the user, the client submits a GET request to one of those APIs by including the authorization response from step 2 as part of the Authorization header for the request (step 3 in the diagram). It is important that the communication for this request is encrypted (ie. use HTTPS) because the access token should not be revealed to the public.

On the server side, few things happen. After extracting the Authorization header from the request we validate the token with the Social Media service (step 4).

Here are the URLs that you can use to validate the tokens:

  • Facebook (as well as documentation)
    https://graph.facebook.com/debug_token?input_token={token-to-inspect}&access_token={app-token-or-admin-token}
  • Google (as well as documentation)
    https://www.googleapis.com/oauth2/v3/tokeninfo?access_token={token-to-inspect}
  • Twitter (as well as documentation)
    https://api.twitter.com/1/account/verify_credentials.json?oauth_access_token={token-to-inspect}

If the token is valid, we compare the ID extracted from the Authorization header with the one specified in the URL. If any of the above two fail we return a 401 Unauthorized response to the client. If we pass those two checks, we do a lookup in our user database to find the user with the specified Social Media ID (step 5. in the diagram) and retrieve her record. We also retrieve information about her group participation so that we can do authorization later on for each one of the functional calls. If we cannot find the user in our database we return a 404 Not found response to the client.

Create API Token

For the purposes of our APIs, we decided to use encrypted JWT tokens. We include the following information into the JWT token:

  • User information like ID, first name and last name, email, address, city, state, zip code
  • Group membership for the user including roles
  • The authentication token for the Social Media service the user authenticated with
  • Expiration time (we settled on 60 minutes expiration)

Before we send this information back to the client (step. 8 in the diagram) we encrypt it (step. 7) using an encryption key or secret that we keep in Azure Key Vault (step. 6). The JWT token is sent back to the client in the Authorization header.

Call the functional APIs

Now, we replaced the access token the client received from the Social Media site with a JWT token that our application can understand and use for authentication and authorization purposes. Each request to the functional APIs (step 9 in the diagram) is required to have the JWT token in the Authorization header. Each API handler has access to the encryption key that is used to decrypt the token and extract the information from it (step 10).

Here are the checks we do before  every request is handled (step 11):

  • If the token is missing we return¬†401 Unauthorized¬†to the client
  • If the user ID in the URL doesn’t match the user ID stored in the JWT token we return 401 Unauthorized¬†to the client. All API requests for our product are executed in the context of the user
  • If the JWT token has expired we return 401 Unauthorized to the client. For now, we decided to expire the JWT token every 60 mins and request the client to re-authenticate with the APIs. In the future, we may decide to extend the token for another 60 mins or until the Social Media access token expires, so that we can avoid the user dissatisfaction from frequent logins. Hence we overdesigned the JWT token to store the Social Media access token also
  • If the user has no right to perform certain operation we return 403 Forbidden to the client denoting that the operation is forbidden for this user

Few notes on the implementation. Because we use Python we can easily implement all the authentication and authorization checks using decorators, which make our API handlers much easier to read and also enable an easy extension in the future (like for example, extending the validity of the JWT token). Python has also an easy to use JWT library available at Github at https://github.com/jpadilla/pyjwt.

Some additional resources that you may find useful when implementing JWT web tokens are: