Blog

If you have missed the news lately, cybersecurity is one of the most discussed topics nowadays. From supply chain exploits to data leaks to business email compromise (BEC) there is no break – especially during the pandemic. Many (if not all) start with an account compromise. And if you ask any cybersecurity expert, they will tell you that the best way to protect your account is to use two-factor (or multi-factor) authentication. Well, let me tell you a secret – MFA sucks! Ask the Okta guys! Even they think MFA sucks. And they are a mobile security company. Though, Randall and I have different motives to make that claim.

By the way: 2FA stands for “two-factor authentication” while MFA stands for “multi-factor authentication”. I will use those two acronyms to save on some typing. And one more, TLA means a “three-letter acronym”.

Randall goes on and on in his post about why MFA sucks. Most of his points are valid! It is an annoying, and frustrating experience. I don’t know about slow, but I would argue against being pointless – it serves a purpose, a very good purpose. Where he is mainly wrong is thinking that the solution is yet another technology (Well, the whole point of his post is to market Okta’s new technology, so, he will get a pass for that 😉 ). This new technology will not address the source of the issue – that people are scared to use MFA. Take a look at the Twitter Account Security survey – why do you think only 2.3% (at the time of this writing) of all Twitter users have MFA enabled? Here are what I think the reasons are:

  • Complexity and lack of understanding of the technology
  • Fear of losing access to the accounts

I believe people are smart enough to grasp the benefits without too many explanations. What they are not clear is how to set it up and how to make sure they don’t lose access to their accounts. In general, my frustration is with how the technology vendors have implemented MFA – without any thought about the user experience. Let me illustrate what I mean with my own experience.

The Problem With Too Many MFAs

I have set up MFA on all my important accounts. The list is long: bank accounts, credit card accounts, stock brokerage accounts, government accounts (like taxes, DMV, etc), email accounts (like Office 365 and GMail), GitHub, Twitter, Facebook, you name it. Have been doing this for years already and the list continues to grow. I am also required by former and current employers to use MFA for my work accounts – also a long list. Here is a list of MFA methods I use (and I don’t claim this to be a comprehensive list):

  • SMS
  • email (several different emails)
  • authenticators apps (here the things are getting crazy)
    • Microsoft Authenticator
    • Google Authenticator (I started with this one)
    • Entrust
    • VIP Access
    • Lastpass Authenticator
  • other vendor-specific implementations (I really don’t know how to call those, but they have their own way to do it)
    • Apple
    • WordPress
    • Facebook
  • Yubikey (I have three of those and I will ignore those in this post because the hardware key experience has been the same since the … 90s or so)

You’ll think I am crazy, right? Why do I use so many authenticators and MFA methods? I don’t want to, but the problem is that I have to!

First of all, this all evolved over the last 7-8 years since I started using MFA. It started with Google Authenticator; then Microsoft Authenticator and the Lastpass one; then a few banks added email and SMS (not very secure but interestingly some very big financial institutions still don’t offer other options!!!) while others struck deals with specific vendors to use their own authenticators – hence, I had to install one-off apps for those accounts. A couple of years ago I bought Yubukeys to secure my password managers and some important work and bank accounts.

At some point in time, I decided to unify on a single method! Or a single authenticator app and the Yubikey. That turns out to be impossible! Despite the fact that there is a standard, the authenticator vendors do not give you an easy way to transfer the seeds from one app to another. The only way to do that is to go over tens of accounts and register the new authenticator. Who wants to do that! Also, a few of my financial institutions do not offer a way to use a different MFA method than the one they approved. So, I am stuck with the one-offs. Unless I want to change to different financial institutions. But… who wants to do that!

There is one more problem with the use of so many tools. Very often the systems that are set up with MFA ask you to “enter the code from your MFA app”. The question is: which MFA app have I registered for that system? There is an argument whether having details about the app used to generate the code will compromise the security. Some companies provide that information, while others don’t. If I am allowed to use a single authenticator app for all my accounts, I don’t mind not having information on what tool have I registered. But in the current situation, giving me a hint is a requirement for usability. Anyway, if my authenticator app is compromised, not telling me what app to use will make absolutely no difference.

Another problem with the authenticator apps is that (and this is prevalent in technology) the app thinks it knows best how to name things. If I for example have two accounts for GitHub and want to use MFA, the authenticators will show them both as GitHub. What if I have ten? Which code should I use for which account is really hard to figure out.

The Problem With Switching Phones

Wait! Should I say: “The Problem With Losing Phones?” No, the problem with losing phones is next!

This is where the complexity of the MFA approach starts to show up! I don’t think any of the above-mentioned vendors really thought the whole user experience thoroughly through. I will add Duo to that list because I used it and it also sucks. Also, I will make a note here that colleagues recommended me Twillio’s Authy but I am already so deep in my current (diverse) ecosystem that I have no desire to try one more of those.

My wife (who is not in technology) has an iPhone 7 and she strongly refuses to change it because of all the trouble she needs to go through to set up her apps on a new phone. I spent a lot of time convincing her to set up at least SMS-based MFA for her bank and credit card accounts. And I think that will be the extend that she will go to. After I switched from iPhone 7 to the latest (and greatest) iPhone 13, I completely understand her fear. (As a side note: changing your phone every year is really, really, really bad for the environment. Be environmentally friendly and use your gadgets for a longer time! I have set a goal to use my phones and other gadgets for at least 5 years.) It has been a few months already and I continue to use some of the authenticators on my old phone because it is such a pain to migrate them to the new one.

Let me quickly go over the experience one by one.

Moving MFA Apps From Phone to Phone

Moving Apple MFA to my new phone was fluent. At the end of the day they are the kings of the experiences and this was expected. Moving WordPress and Facebook was also relatively simple – as long as you manage to sign in to your WordPress and Facebook accounts on your new phone, you will start getting the prompts there.

Moving the Lastpass Authenticator should have been easy but they really screwed up the flow between the actual Lastpass app and the authenticator app. I was clicking like crazy back and forth, going in circles for quite some time until something magical happened and it started working on my new phone. For the accounts where I used Entrust I had to go and register the new phone. Inconvenient, but at least I had a self-service. The problems started appearing when I got to VIP Access – I have to call my financial institution because they are the only ones that can register it. This will mean at least one hour on the phone.

Now, let’s get to the big ones!

Google Authenticator apparently has export functionality that allows you to export the seeds and import them in your new phone. If you know about that, it works like a charm too but… I just recently learned about it from Dave Bittner and Joe Carrigan from The Cyberwire.

Microsoft Authenticator should have been the easiest one (they claim). As long as you are signed in with your Microsoft Account, you should be able to get all the codes on your new phone. Well, king of! This works for other MFA accounts except for Microsoft work and school accounts. With all due respect to my colleagues from Azure Active Directory – the work and school account move sucks! You just need to go and register the new phone with those. Really disappointing!

“Insecure” MFA Methods When Switching Phones

Let me write about the non-secure ways to use MFA!

As I mentioned above, I also use email and SMS for certain accounts. The email experience also sucks! OK, I will be honest – this is certainly my own problem and few people may have this one but this is my rant so I will go with it. I have created many email accounts collected over time (about the reasons for that in some other post). One or another email account is used for MFA depending on what email address I’ve used for registration on a particular website. Those emails are synchronized to a single email account but… the synchronization is on schedule – about 30 mins or so. Now, every MFA code normally expires within 10 mins. Either, I miss the time window to enter the code or I need to login into that particular email account to get the code or I need to force the sync manually (yeah, I can do that but it is annoying. Right, Randall?!). Switching my phone has nothing to do with my emails so – there is no impact.

And the last one – SMS! Well, SMS doesn’t suck! You heard me! SMS doesn’t suck! … most of the time. Sometimes you will not get the text message on time due to networking issues but it works perfectly 99.999% of the time. Oh, and if I switch my phone, it continues to work without any additional configurations, QR codes, or calling my telco – like magic 😉

The Problem With Losing Your Phone

Now, here is where things get serious! Or serious if you use mobile apps for MFA. If you lose your phone, you are screwed. You will be locked out of all your accounts or most of them.

Apple is fine if you are in the Apple ecosystem. I have a few MacBooks, iPad, and AppleWatch – at least one of them will get me the MFA code.

With Facebook, I am screwed unless I am signed in to one of my browsers. For a person like me who uses Facebook once in a blue moon, the probability is low, so if I lose my phone, I am screwed. (Maybe that will push me over the edge to finally stop using them 🙂 ). I assume the WordPress story will be the same as with Facebook. Oh, and have you ever tried to get Facebook support on the phone to help you unlock your account? 🙂

About the other ones…

Backup Codes (or The Problem With Backup Codes)

Well, most of the systems allow you to print a bunch of backup codes that you can store “safely” so if you get locked out you can “easily” sign back in. I emphasize the words “safely” and “easily”. Here is why!

Storing Backup Codes Safely

Define “safely”! The experts recommend that you print your backup codes on paper and store them “safely” offline. I assume that they mean putting it in my safe deposit box at home, right? Because I will need to have easy access to those when I get locked out. It is questionable how safe is that because robberies are not uncommon. I had a colleague who got robbed and he found his safe deposit box in the bushes behind his house – of course, empty! Also, most of those safe deposit boxes are not fire- and waterproof. So, you need to buy a fireproof safe deposit box and cement it in your basement so no grown person (I mean teenager or older) can take it out with a crowbar.

Or, they mean to put it in the safe deposit box in the bank. Where there is security and the probability of robbery is minuscule. But then, I need to run to the bank every time I get locked out.

In both cases, this is not easy. From both a logistics point of view and a usability point of view. At the end of the day, what if I am on a trip and lose my phone (which is a quite realistic scenario).

To avoid all this hassle, most of us find some workarounds. Here are a few for you (and be honest and admit that you do those too):

  • Saving the backup codes as files and putting them in a folder on your laptop.
  • Copying the backup codes and storing them in your password manager (together with your password – how secure is that? 🙂 )
  • Saving the backup codes as files and keeping them on a thumb drive in your drawer.
  • Saving them as files on your Dropbox, OneDrive, Google Drive, or another cloud drive.
  • You see where I am going with those…

To be honest, I even didn’t bother printing/saving the backup codes for some of my accounts (and not all of the systems offer that option), which I assume many of us do.

Even if I print them and store them in my safe, I need to print details of what account they belong to and if they get stolen, all my accounts will be compromised.

Storing the Seeds in the Cloud

Some of the authenticators like Microsoft Authenticator keep your seeds in the cloud. Authe was recommended to me for the same reason. The idea is that if you lose your phone, you can sign in to your authenticator on your new phone and it will sync your seeds on the new phone. Magical, right? Yes, if you are able to sign in to the authenticator on your new phone… without your MFA code. So, you are caught in this vicious circle that if you lose your phone, you will need an MFA code to sign in to your authenticator but you have no way to get the MFA code.

The Solution (Backed by the Numbers)

What are you left with if you lose your phone? The only two MFA methods that work for a lost phone are email and SMS (because even if I lose my phone, I can easily keep my number). They are the most insecure ones but have the lowest risk to get you locked out from your accounts.

I am not promoting the use of SMS and email as the second factor for authentication. But the numbers show that majority of the users who use MFA use SMS instead of an app or a hardware key (see the Twitter report). Let’s run this simple math:

  • Twitter has about 396.5M users.
  • 2.3% (at the time of this writing) use MFA for their Twitter account. This is ~9.12M MFA users.
  • 0.5% of those 9.12M use a hardware key. This is just 456K hardware key users.
  • 30.9% of those 9.12M use an auth app. This is 2.82M auth app users.
  • 79.6% of those 9.12M use SMS. This is 7.26M SMS users.

It would be nice if Twitter had a way to break those numbers down by occupation (although it will be a violation of privacy). Pretty sure they will show that the majority of people who use an auth app or a hardware key work in technology. The normal users who deem their account important protect them with SMS because SMS offers the easiest user experience.

One more thing about SMS. Because everybody is scared to lock themselves out from their accounts, people set up their authenticator app as the primary MFA tool but then they have SMS as the backup. This way, if they lose their phone, they can still gain access to their account using SMS. But as we know, security is as strong as the weakest link – in this case the SMS. The setup of an authenticator app just gives the false illusion of security.

Summary

Using more than one factor for authentication is a MUST. Using stronger authenticators would be nice but with the current experience will be hard to achieve. To convince more people to do that, companies need to offer a much friendlier experience to their users:

  • Freedom to choose authenticator app
  • Self-service
  • Easy recovery

Without those, the usage will be at the current Twitter numbers.

MFA is yet another technology developed without the user in mind. But unfortunately, a technology that is at the core of cybersecurity. It is a shame that the security vendors continue to produce all kinds of new technologies (with fancy names like SOAR, SIEM, EDR, XDR, ML, AI) without fixing the basic user experience.

Friendly face on parachute

How often the following happens to you? You write your client code, you call an API, and receive a 404 Not found response. You start investigating the issue in your code; change a line here or there; spend hours troubleshooting just to find out that the issue is on the server-side, and you can’t do anything about it. Well, welcome to the microservices world! A common mistake I often see developers make is returning an improper response code or passing through the response code from another service.

Let’s see how we can avoid this. But first, a crash course on modern applications implemented with microservices and HTTP status response codes.

How Modern Microservices Applications Work?

I will try to avoid going deep into the philosophical reasons why we need microservices and the benefits (or disadvantages) of using them. This is not the point of this post.

We will start with a simple picture.

Microservices ApplicationAs you can see in the picture, we have a User that interacts with the Client Application that calls Microservice #1 to retrieve some information from the server (aka the cloud 🙂). The Client Application may need to call multiple (micro)services to retrieve all the information the User needs. Still, the part we will concentrate on is that Microservice #1 itself can call other services (Microservice #2 in this simple example) on the backend to perform its business logic. In a complex application (especially if not well architected), the chain of service calls may go well beyond two. But let’s stick with two for now. Also, let’s assume that Microservice #1 and Microservice #2 use REST, and their responses use the HTTP response status codes.

A basic call flow can be something like this. I also include the appropriate HTTP status response codes in each step.

  1. The User clicks on a button in the Client Application.
  2. The Client Application makes an HTTP request to Microservice #1.
  3. Microservice #1 needs additional business logic to complete the request and make an HTTP call to Microservice #2.
  4. Microservice #2 performs the additional business logic and responds to Microservice #1 using a 200 OK response code.
  5. Microservice #1 completes the business logic and responds to the Client Application with a 200 OK response code.
  6. The Client Application performs the action that is attached to the button, and the User is happy.

This is the so-called happy path. Everybody expects the flow to be executed as described above. If everything goes as planned, we don’t need to think anymore and implement the functionality behind the next button. Unfortunately, things often don’t go as planned.

What Can Go Wrong?

Many things! Or at a minimum, the following:

  1. The Client Application fails because of a bug before it even calls Microservice #1.
  2. The Client Application sends invalid input when calling Microservice #1.
  3. Microservice #1 fails before calling Microservice #2.
  4. Microservice #1 sends invalid input when calling Microservice #2.
  5. Microservice #2 fails while performing its business logic.
  6. Microservice #1 fails after calling Microservice #2.
  7. The Client Application fails after Microservice #1 responds.

For those cases (non-happy path? or maybe sad-path? 😉 ) the designers of the HTTP protocol wisely specified two separate sets of response codes:

The guidance for those is quite simple:

  • Client errors should be returned if the client did something wrong. In such cases, the client can change the parameters of the request and fix the issue. The important thing to remember is that the client can fix the issue without any changes on the server-side.
    A typical example is the famous 404 Not found error. If you (the user) mistype the URL path in the browser address bar, the browser (the client application) will request the wrong resource from the server (Microservice #1 in this case). The server (Microservice #1) will respond with a 404 Not found error to the browser (the client application) and the browser will show you “Oops, we couldn’t find the page” message. Well, in the past the browser just showed you the 404 Not found error but we learned a long time ago that this is not user-friendly (You see where I am going with this, right?).
  • Server errors should be returned if the issue occurred on the server-side and the client (and the user) cannot do anything to fix it.
    A simple example is a wrong connection string in the service configuration (Microservice #1 in our case). If the connection string used to configure Microservice #1 with the endpoint and credentials for Microservice #2 is wrong, the client application and the user cannot do anything to fix it. The most appropriate error to return in this case would be 500 Internal server error.

Pretty simple and logical, right? Though, one thing, we as engineers often forget, is who the client and who the server is.

So, Who Is the Client and Who Is the Server?

First, the client and server are two system components that interact directly with each other (think, no intermediaries). If we take the picture from above and change the labels of the arrows, it becomes pretty obvious.

Microservices Application Clients and Servers

We have three clients and three servers:

  • The user is a client of the client application, and the client application is a server for the user.
  • The client application is a client of Microservice #1, and Microservice #1 is a server for the client application.
  • Microservice #1 is a client of Microservice #2, and Microservice #2 is a server for Microservice #1.

Having this picture in mind, the engineers implementing each one of the microservices should think about the most appropriate response code for their immediate client using the guidelines above. It is better if we use examples to explain what response codes each service should return in different situations.

What HTTP Response Codes Should Microservices Return?

A few days ago I was discussing the following situation with one of our engineers. Our service, Azure Container Service (ACR), has a security feature allowing customers to encrypt their container images using customer-managed keys (CMK). For this feature to work, customers need to upload a key in Azure Key Vault (AKV). When the Docker client tries to pull an image, ACR retrieves the key from AKV, decrypts the image, and sends it back to the Docker client. (BTW, I know that ACR and AKV are not microservices 🙂 ) Here is a visual:

Docker pull encrypted image from ACR

In the happy-path scenario, everything works as expected. However, a customer submitted a support request complaining that he is not able to pull his images from ACR. When he tries to pull an image using the Docker client, he receives a 404 Not found error, but when he checks in the Azure Portal, he is able to see the image in the list.

Because the customer couldn’t figure it out by himself, he submitted a support request. The support engineer was also not able to figure out the issue, and had to escalate to the product group. It turned out that the customer deleted the Key Vault and ACR was not able to retrieve the key to decrypt the image. However, the implemented flow looked like this:

  1. Docker client requests an image from ACR.
  2. ACR sees that the image is encrypted and requests the key from the Key Vault.
  3. The Azure Key Vault service looks up the key and figures out that the key (or the whole Key Vault) is missing.
  4. Azure Key Vault returns 404 Not found to ACR for the key ACR tries to access.
  5. ACR passes through the 404 Not found error to the Docker client.
  6. Docker client shows a message to the user that the image cannot be found.

The end result: everybody is confused! Why?

Where the Confusion Comes From?

The investigation chain goes from left to right: Docker client –> ACR –> AKV. Both the customer and the support engineer were concentrated on figuring out why the image is missing in ACR. They were looking only at the Docker client –> ACR part of the chain. The customer’s assumption was that the Docker client is doing something wrong, i.e. requesting the wrong image. This would be the correct assumption because 404 Not found is a client error telling the client that is requesting something that doesn’t exist. Hence, the customer checked the portal and when he saw the image in the list, he was puzzled. The next assumption is that something is wrong on the ACR side. Here is where the customer decided to submit a support request for somebody to check if the data in ACR is corrupted. The support engineer checked the ACR backend and all the data was in sync.

This is a great example where the wrong HTTP response code can send the whole investigation into a rabbit hole. To avoid that, here is the guidance! Microservices should return response codes that are relevant to the business logic they implement and ones that help the client take appropriate actions. “Well”, you will say: “Isn’t that the whole point of HTTP status response codes?” It is! But for whatever reasons, we continue to break this rule. The key words in the above guidance are “the business logic they implement”, not the business logic of the services they call. (By the way, this is the same with exceptions. You don’t catch generic Exception, you catch SpecificException. You don’t pass through exceptions, you catch them and wrap them in a useful way for the calling code).

Business Logic and Friendly HTTP Response Codes

Think about the business logic of each one of the services above!

One way to decide which HTTP response code to return is to think about the resource your microservice is handling. ACR is the service responsible for handling the container images. The business logic that ACR implements should provide status codes relavant to the “business” of images. Azure Key Vault implement business logic that handles keys, secrets, and certificates (not images). Key Vault should return status codes that are relevant to the keys, secrets, and certificates. Azure Key Vault is a downstream service and cannot know what the key is used for, hence cannot provide details to the upstream client (Docker) what the error is. It is responsibility of the ACR to provide the approapriate status code to the upstream client.

Here is how the flow in the above scenario should be implemented:

  1. Docker client requests an image from ACR.
  2. ACR sees that the image is encrypted and requests the key from the Key Vault.
  3. The Azure Key Vault service looks up the key and figures out that the key (or the whole Key Vault) is missing.
  4. Azure Key Vault returns 404 Not found to ACR for the key ACR tries to access.
  5. ACR handles the 404 Not found from Azure Key Vault but wraps it in a error that is relevant to the requested image.
  6. Instead 404 Not found, ACR returns 500 Internal server error with a message clarifying the issue.
  7. Docker client shows a message to the user that it cannot pull the image because of an issue on the server.

The Q&A Approach

Another way that you can use to decide what response code to return is to take the Questions-and-Answers approach and build a simple IF-THEN logic (aka. decition tree). Here is how this can work for our example:

  • Docker: Pull image from ACR
    • ACR: Q: Is the image ivailable?
      • A: Yes
        (Note to myself: Requesting the image cannot be a client error anymore.)

        • Q: Is the image encrypted?
          • A: Yes
            • ACR: Request the key from Key Vault
              • AKV: Q: Is the key available?
                • A: Yes
                  • AKV: Return the key to ACR
                • A: No
                  • AKV: Return 404 [key] Not found error
            • ACR: Q: Did I get a key?
              • A: Yes
                • ACR: Decrypt the image
                • ACR: Return 200 OK with the image payload
              • A: No (I got 404 [key] Not found)
                • ACR: I cannot decrypt the image
                  (Note to myself: There is nothing the client did wrong! It is all the server fault)
                • ACR: Return 500 Internal server error “I cannot decrypt the image”
          • A: No (image is not encrypted)
            • ACR: Return 200 OK with the image payload
      • A: No (image does not exist)
        • ACR: Return 404 [image] Not found error

Note that the above flow is simplified. For example, in a real implementation, you may need to check if the client is authenticated and authorized to pull the image. Nevertheless, the concept is the same – you will just need to have more Q&As.

Summary

As you can see, it is important to be careful what HTTP response codes you return from your microservices. If you return the wrong message, you may end up with more work than you expect. Here are the main points that is worth remembering:

  • Return 400 errors only if the client can do something to fix the issue. If the client cannot do anything to fix it, 500 errors are the only appropriate ones.
  • Do not pass through the response codes you receive from upstream services. Handle each response from upstream services and wrap it according to the business logic you are implementing.
  • When implementing your services, think about the resource you are handling in those services. Return HTTP status response codes that are relevant to the resource you are handling.
  • Use the Q&A approach to decide what is the appropriate response code to return for your service and the resource that is requested by the client.

By using those guidelines, your microservices will become more friendly and easier to troubleshoot.

Featured image by Nick Page on Unsplash

In the last few months, I started seeing more and more customers using Azure Container Registry (or ACR) for storing their Helm charts. However, many of them are confused about how to properly push and use the charts stored in ACR. So, in this post, I will document a few things that need the most clarifications. Let’s start with some definitions!

Helm 2 and Helm 3 – what are those?

Before we even start!

Helm 2 is NOT supported and you should not use it! Period! If you need more details just read Helm’s blog post Helm 2 and the Charts Project Are Now Unsupported from Fri, Nov 13, 2020.

A nice date to choose for that announcement 🙂

OK, really – what are Helm 2 and Helm 3? When somebody says Helm 2 or Helm 3, they most often mean the version of the Helm CLI (i.e., Command Line Interface). The easiest way to check what version of the Helm CLI you have is to type:

$ helm version

in your Terminal. If the version is v2.x.x , then you have Helm (CLI) 2; if the version is v3.x.x then, you have Helm (CLI) 3. But, it is not that simple! You should also consider the API version. The API version is the version that you specify at the top of your chart (i.e. Chart.yaml) – you can read more about it in Helm Charts documentation. Here is a table that can come in handy for you:

wdt_ID apiVersion Helm 2 CLI Support Helm 3 CLI Support
1 v1 Yes Yes
2 v2 No Yes

What this table tells you is that Helm 2 CLI supports apiVersion V1, while Helm 3 CLI supports apiVersion V1 and V2. You should check the Helm Charts documentation linked above if you need more details about the differences, but the important thing to remember here is that Helm 3 CLI supports old charts, and (once again) there is no reason for you to use Helm 2.

We’ve cleared (I hope) the confusion around Helm 2 and Helm 3. Let’s see how ACR handles the Helm charts. For each one of those experiences, I will walk you step-by-step.

ACR and Helm 2

Azure Container Registry allows you to store Helm charts using the Helm 2 way (What? I will explain in a bit). However:

Helm 2 is NOT supported and you should not use it!

Now that you have been warned (twice! or was it three times?) let’s see how ACR handles Helm 2 charts. To avoid any ambiguity, I will use Helm CLI v2.17.0 for this exercise. At the time of this writing, it is the last published version of Helm 2.

$ helm version
Client: &version.Version{SemVer:"v2.17.0", GitCommit:"a690bad98af45b015bd3da1a41f6218b1a451dbe", GitTreeState:"clean"}

Initializing Helm and the Repository List

If you have a brand new installation of the Helm 2 CLI, you should initialize Helm and add the ACR to your repository list. You start with:

$ helm init
$ helm repo list
NAME     URL
stable   https://charts.helm.sh/stable
local    http://127.0.0.1:8879/charts

to initialize Helm and see the list of available repositories. Then you can add your ACR to the list by typing:

$ helm repo add --username <acr_username> --password <acr_password> <repo_name> https://<acr_login_server>/helm/v1/repo

For me, this looked like this:

$ helm repo add --username <myacr_username> --password <myacr_password> acrrepo https://tsmacrtestwus2acrhelm.azurecr.io/helm/v1/repo

Here is something very important: you must use the /helm/v1/repo path! If you do not specify the path you will see either 404: Not found error (for example, if you use the root URL without a path) or 403: Forbidden error (if you decide that you want to rename the repo part to something else).

I also need to make a side note here because the authentication can be a bit tricky. The following section applies to both Helm 2 and Helm 3.

Signing In To ACR Using the Helm (any version) CLI

Before you push and pull charts from ACR, you need to sign in. There are few different options that you can use to sign in to ACR using the CLI:

  • Using ACR Admin user (not recommended)
    If you have ACR Admin user enabled, you can use the Admin user username and password to sign in to ACR by simply specifying the --username and --password parameters for the Helm command.
  • Using a Service Principal (SP)
    If you need to push and pull charts using automation, you have most probably already set up a service principal for that. You can authenticate using the SP credentials by passing the app ID in the  --username and the client secret in the --password parameters for the Helm command. Make sure you assign the appropriate role to the service principal to allow access to your registry.
  • Using your own (user) credentials
    This one is the tricky one, and it is described in the ACR docs in az acr login with –expose-token section of the Authentication overview article. For this one, you must use the Azure CLI to obtain the token. Here are the steps:

    • Use the Azure CLI to sign in to Azure using your own credentials:
      $ az login

      This will pop up a browser window or give you an URL with a special code to use.

    • Next, sign in to ACR using the Azure CLI and add the --expose-token parameter:
      $ az acr login --name <acr_name_or_login_server> --expose-token

      This will sign you into ACR and will print an access token that you can use to sign in with other tools.

    • Last, you can sign in using the Helm CLI by passing a GUID-like string with zeros only (exactly this string 00000000-0000-0000-0000-000000000000) in the  --username parameter and the access token in the --password parameter. Here is how the command to add the Helm repository will look like:
      $ helm repo add --username "00000000-0000-0000-0000-000000000000" --password "eyJhbGciOiJSUzI1NiIs[...]24V7wA" <repo_name> https://<acr_login_server>/helm/v1/repo

Creating and Packaging Charts with Helm 2 CLI

Helm 2 doesn’t have out-of-the-box experience for pushing charts to a remote chart registry. You may wrongly assume that the helm-push plugin is the one that does that, but you will be wrong. This plugin will only allow you to push charts to Chartmuseum (although I can use it to try to push to any repo but will fail – a topic for another story). Helm’s guidance on how chart repositories should work is described in the documentation (… and this is the Helm 2 way that I mentioned above):

  • According to Chart Repositories article in Helm documentation, the repository is a simple web server that serves the index.yaml file that points to the chart TAR archives. The TAR archives can be served by the same web server or from other locations like Azure Storage.
  • In Store charts in your chart repository they describe the process to generate the index.yaml file and how to upload the necessary artifacts to static storage to serve them.

Disclaimer: the term Helm 2 way is my own term based on my interpretation of how things work. It allows me to refer to the two different approaches charts are saved. It is not an industry term not something that Helm refers to or uses.

I have created a simple chart called helm-test-chart-v2 on my local machine to test the push. Here is the output from the commands:

$ $ helm create helm-test-chart-v2
Creating helm-test-chart-v2

$ ls -al ./helm-test-chart-v2/
total 28
drwxr-xr-x 4 azurevmuser azurevmuser 4096 Aug 16 16:44 .
drwxr-xr-x 6 azurevmuser azurevmuser 4096 Aug 17 16:29 ..
-rw-r--r-- 1 azurevmuser azurevmuser 342 Aug 16 16:44 .helmignore
-rw-r--r-- 1 azurevmuser azurevmuser 114 Aug 16 16:44 Chart.yaml
drwxr-xr-x 2 azurevmuser azurevmuser 4096 Aug 16 16:44 charts
drwxr-xr-x 3 azurevmuser azurevmuser 4096 Aug 16 16:44 templates
-rw-r--r-- 1 azurevmuser azurevmuser 1519 Aug 16 16:44 values.yaml

$ helm package ./helm-test-chart-v2/
Successfully packaged chart and saved it to: /home/azurevmuser/helm-test-chart-v2-0.1.0.tgz

$ ls -al
total 48
drwxr-xr-x 6 azurevmuser azurevmuser 4096 Aug 17 16:31 .
drwxr-xr-x 3 root root 4096 Aug 14 14:12 ..
-rw------- 1 azurevmuser azurevmuser 780 Aug 15 22:48 .bash_history
-rw-r--r-- 1 azurevmuser azurevmuser 220 Feb 25 2020 .bash_logout
-rw-r--r-- 1 azurevmuser azurevmuser 3771 Feb 25 2020 .bashrc
drwx------ 2 azurevmuser azurevmuser 4096 Aug 14 14:15 .cache
drwxr-xr-x 6 azurevmuser azurevmuser 4096 Aug 15 21:46 .helm
-rw-r--r-- 1 azurevmuser azurevmuser 807 Feb 25 2020 .profile
drwx------ 2 azurevmuser azurevmuser 4096 Aug 14 14:12 .ssh
-rw-r--r-- 1 azurevmuser azurevmuser 0 Aug 14 14:18 .sudo_as_admin_successful
-rw------- 1 azurevmuser azurevmuser 1559 Aug 14 14:26 .viminfo
drwxr-xr-x 4 azurevmuser azurevmuser 4096 Aug 16 16:44 helm-test-chart-v2
-rw-rw-r-- 1 azurevmuser azurevmuser 3269 Aug 17 16:31 helm-test-chart-v2-0.1.0.tgz

Because Helm 2 doesn’t have a push chart functionality, the implementation is left up to the vendors. ACR has provided proprietary implementation (already deprecated, which is another reason to not use Helm 2) of the push chart functionality that is built into the ACR CLI.

Pushing and Pulling Charts from ACR Using Azure CLI (Helm 2)

Let’s take a look at how you can push Helm 2 charts to ACR using the ACR CLI. First, you need to sign in to Azure, and then to your ACR. Yes, this is correct; you need to use two different commands to sign into the ACR. Here is how this looks like for my ACR registry:

$ az login
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code AABBCCDDE to authenticate.
[
  {
    "cloudName": "AzureCloud",
    "homeTenantId": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
    "id": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
    "isDefault": true,
    "managedByTenants": [],
    "name": "ToddySM Sandbox",
    "state": "Enabled",
    "tenantId": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
    "user": {
        "name": "toddysm_XXXXXXXX@outlook.com",
        "type": "user"
    }
  }
]
$ az acr login --name tsmacrtestwus2acrhelm.azurecr.io
The login server endpoint suffix '.azurecr.io' is automatically omitted.
You may want to use 'az acr login -n tsmacrtestwus2acrhelm --expose-token' to get an access token, which does not require Docker to be installed.
An error occurred: DOCKER_COMMAND_ERROR
Please verify if Docker client is installed and running.

Well, I do not have Docker running, but this is OK – you don’t need Docker installed for pushing the Helm chart. Though, it may be confusing because it leaves the impression that you may not be signed in to the ACR registry.

We will push the chart that we packaged already. Pushing it is done with the (deprecated) built-in command in ACR CLI. Here is the output:

$ az acr helm push --name tsmacrtestwus2acrhelm.azurecr.io helm-test-chart-v2-0.1.0.tgz
This command is implicitly deprecated because command group 'acr helm' is deprecated and will be removed in a future release. Use 'helm v3' instead.
The login server endpoint suffix '.azurecr.io' is automatically omitted.
{
  "saved": true
}

This seems to be successful, and I have a Helm chart pushed to ACR using the Helm 2 way (i.e. using the proprietary and deprecated ACR CLI implementation). The problem here is that it is hard to verify that the chart is pushed to the ACR registry. If you go to the portal, you will not see the repository that contains the chart. Here is a screenshot of my registry view in the Azure portal after I pushed the chart:

Azure Portal Not Listing Helm 2 ChartsAs you can see, the Helm 2 chart repository doesn’t appear in the list of repositories in the Azure portal, and you will not be able to browse the charts in that repository using the Azure portal. However, if you use the Helm command to search for the chart, the result will include the ACR repository. Here is the output from the command in my environment:

$ helm search helm-test-chart-v2
NAME                           CHART VERSION        APP VERSION        DESCRIPTION
acrrepo/helm-test-chart-v2     0.1.0                1.0                A Helm chart for Kubernetes
local/helm-test-chart-v2       0.1.0                1.0                A Helm chart for Kubernetes

Summary of the ACR and Helm 2 Experience

To summarize the ACR and Helm 2 experience, here are the main takeaways:

  • First, you should not use Helm 2 CLI and the proprietary ACR CLI implementation for working with Helm charts!
  • There is no push functionality for charts in the Helm 2 client and each vendor is implementing their own CLI for pushing charts to the remote repositories.
  • When you add ACR repository using the Helm 2 CLI you should use the following URL format https://<acr_login_server>/helm/v1/repo
  • If you push a chart to ACR using the ACR CLI implementation you will not see the chart in Azure Portal. The only way to verify that the chart is pushed to the ACR repository is to use the helm search command.

ACR and Helm 3

Once again, to avoid any ambiguity, I will use Helm CLI v3.6.2 for this exercise. Here is the complete version string:

PS C:> helm version
version.BuildInfo{Version:"v3.6.2", GitCommit:"ee407bdf364942bcb8e8c665f82e15aa28009b71", GitTreeState:"clean", GoVersion:"go1.16.5"}

Yes, I run this one in PowerShell terminal 🙂 And, of course, not in the root folder 😉 You can convert the commands to the corresponding Linux commands and prompts.

Let’s start with the basic thing!

Creating and Packaging Charts with Helm 3 CLI

There is absolutely no difference between the Helm 2 and Helm 3 experience for creating and packaging a chart. Here is the output:

PS C:> helm create helm-test-chart-v3
Creating helm-test-chart-v3

PS C:> ls .\helm-test-chart-v3\

    Directory: C:\Users\memladen\Documents\Development\Local\helm-test-chart-v3

Mode         LastWriteTime         Length         Name
----         -------------         ------         ----
d----    8/17/2021 9:42 PM                        charts
d----    8/17/2021 9:42 PM                        templates
-a---    8/17/2021 9:42 PM            349         .helmignore
-a---    8/17/2021 9:42 PM           1154         Chart.yaml
-a---    8/17/2021 9:42 PM           1885         values.yaml

PS C:> helm package .\helm-test-chart-v3\
Successfully packaged chart and saved it to: C:\Users\memladen\Documents\Development\Local\helm-test-chart-v3-0.1.0.tgz

PS C:> ls helm-test-*

    Directory: C:\Users\memladen\Documents\Development\Local

Mode         LastWriteTime         Length         Name
----         -------------         ------         ----
d----    8/17/2021 9:42 PM                        helm-test-chart-v3
-a---    8/17/2021 9:51 PM           3766         helm-test-chart-v3-0.1.0.tgz

From here on, though, things can get confusing! The reason is that you have two separate options to work with charts using Helm 3.

Using Helm 3 to Push and Pull Charts the Helm 2 Way

You can use Helm 3 to push the charts the same way you do that with Helm 2. First, you add the repo:

PS C:> helm repo add --username <myacr_username> --password <myacr_password> acrrepo https://tsmacrtestwus2acrhelm.azurecr.io/helm/v1/repo
"acrrepo" has been added to your repositories

PS C:> helm repo list
NAME         URL
microsoft    https://microsoft.github.io/charts/repo
acrrepo      https://tsmacrtestwus2acrhelm.azurecr.io/helm/v1/repo

Then, you can update the repositories and search for a chart:

PS C:> helm repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "microsoft" chart repository
...Successfully got an update from the "acrrepo" chart repository
Update Complete. ⎈Happy Helming!⎈

PS C:> helm search repo helm-test-chart
NAME                          CHART VERSION     APP VERSION     DESCRIPTION
acrrepo/helm-test-chart-v2    0.1.0             1.0             A Helm chart for Kubernetes

Ha, look at that! I can see the chart that I pushed using the ACR CLI in the ACR and Helm 2 section above – notice the chart name and the version. Also, notice that the Helm 3 search command has a bit different syntax – it wants you to clarify what you want to search (repo in our case).

I can use the ACR CLI to push the new chart that I just created using the Helm 3 CLI (after signing in to Azure):

PS C:> az acr helm push --name tsmacrtestwus2acrhelm.azurecr.io .\helm-test-chart-v3-0.1.0.tgz
This command is implicitly deprecated because command group 'acr helm' is deprecated and will be removed in a future release. Use 'helm v3' instead.
The login server endpoint suffix '.azurecr.io' is automatically omitted.
{
  "saved": true
}

By doing this, I have pushed the V3 chart to ACR and can pull it from there but, remember, this is the Helm 2 Way and the following are still true:

  • You will not see the chart in Azure Portal.
  • The only way to verify that the chart is pushed to the ACR repository is to use the helm search command.

Here is the result of the search command after updating the repositories:

PS C:> helm repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "microsoft" chart repository
...Successfully got an update from the "acrrepo" chart repository
Update Complete. ⎈Happy Helming!⎈

PS C:> helm search repo helm-test-chart
NAME                           CHART VERSION     APP VERSION     DESCRIPTION
acrrepo/helm-test-chart-v2     0.1.0             1.0             A Helm chart for Kubernetes
acrrepo/helm-test-chart-v3     0.1.0             1.16.0          A Helm chart for Kubernetes

You can see both charts, the one created with Helm 2 and the one created with Helm 3, available. This is understandable though because I pushed both charts the same way – by using the az acr helm command. Remember, though – both charts are stored in ACR using the Helm 2 way.

Using Helm 3 to Push and Pull Charts the OCI Way

Before proceeding, I changed the version property in the Chart.yaml to 0.2.0 to be able to differentiate between the charts I pushed. This is the same chart that I created in the previous section Creating and Packaging Charts with Helm 3 CLI.

You may have noticed that Helm 3 has a new chart command. This command allows you to (from the help text) “push, pull, tag, list, or remove Helm charts”. The subcommands under the chart command are experimental and you need to set the HELM_EXPERIMENTAL_OCI environment variable to be able to use them. Once you do that, you can save the chart. You can save the chart to the local registry cache with or without a registry FQDN (Fully Qualified Domain Name). Here are the commands:

PS C:> $env:HELM_EXPERIMENTAL_OCI=1

PS C:> helm chart save .\helm-test-chart-v3\ helm-test-chart-v3:0.2.0
ref:     helm-test-chart-v3:0.2.0
digest:  b6954fb0a696e1eb7de8ad95c59132157ebc061396230394523fed260293fb19
size:    3.7 KiB
name:    helm-test-chart-v3
version: 0.2.0
0.2.0: saved
PS C:> helm chart save .\helm-test-chart-v3\ tsmacrtestwus2acrhelm.azurecr.io/charts/helm-test-chart-v3:0.2.0
ref:     tsmacrtestwus2acrhelm.azurecr.io/charts/helm-test-chart-v3:0.2.0
digest:  ff5e9aea6d63d7be4bb53eb8fffacf12550a4bb687213a2edb07a21f6938d16e
size:    3.7 KiB
name:    helm-test-chart-v3
version: 0.2.0
0.2.0: saved

If you list the charts using the new chart command, you will see the following:

PS C:> helm chart list
REF                                                                  NAME                   VERSION     DIGEST     SIZE     CREATED
helm-test-chart-v3:0.2.0                                             helm-test-chart-v3     0.2.0       b6954fb    3.7 KiB  About a minute
tsmacrtestwus2acrhelm.azurecr.io/charts/helm-test-chart-v3:0.2.0     helm-test-chart-v3     0.2.0       ff5e9ae    3.7 KiB  About a minute

Few things to note here:

  • Both charts are saved in the local registry cache. Nothing is pushed yet to a remote registry.
  • You see only charts that are saved the OCI way. The charts saved the Helm 2 way are not listed using the helm chart list command.
  • The REF (or reference) for a chart can be truncated and you may not be able to see the full reference.

Let’s do one more thing! Let’s save the same chart with FQDN for the ACR as above but under a different repository. Here are the commands to save and list the charts:

PS C:> helm chart save .\helm-test-chart-v3\ tsmacrtestwus2acrhelm.azurecr.io/my-helm-charts/helm-test-chart-v3:0.2.0
ref: tsmacrtestwus2acrhelm.azurecr.io/my-helm-charts/helm-test-chart-v3:0.2.0
digest: daf106a05ad2fe075851a3ab80f037020565c75c5be06936179b882af1858e6a
size: 3.7 KiB
name: helm-test-chart-v3
version: 0.2.0
0.2.0: saved

PS C:> helm chart list
REF                                                                  NAME                   VERSION     DIGEST     SIZE     CREATED
helm-test-chart-v3:0.2.0                                             helm-test-chart-v3     0.2.0       b6954fb    3.7 KiB  11 minutes
tsmacrtestwus2acrhelm.azurecr.io/charts/helm-test-chart-v3:0.2.0     helm-test-chart-v3     0.2.0       ff5e9ae    3.7 KiB  11 minutes
tsmacrtestwus2acrhelm.azurecr.io/my-helm-charts/helm-test-chart...   helm-test-chart-v3     0.2.0       daf106a    3.7 KiB  About a minute

After doing this, we have three charts in the local registry:

  • helm-test-chart-v3:0.2.0 that is available only locally.
  • tsmacrtestwus2acrhelm.azurecr.io/charts/helm-test-chart-v3:0.2.0 that can be pushed to the remote ACR registry tsmacrtestwus2acrhelm.azurecr.io and saved in the charts repository.
  • and tsmacrtestwus2acrhelm.azurecr.io/my-helm-charts/helm-test-chart-v3:0.2.0 that can be pushed to the remote ACR registry tsmacrtestwus2acrhelm.azurecr.io and saved in the my-helm-charts repository.

Before we can push the charts to the ACR registry, we need to sign in using the following command:

PS C:> helm registry login tsmacrtestwus2acrhelm.azurecr.io --username <myacr_username> --password <myacr_password>

You can use any of the sign-in methods described in Signing in to ACR Using the Helm CLI section. And make sure you use your own ACR registry login server.

If we push the two charts that have the ACR FQDN, we will see them appear in the Azure portal UI. Here are the commands:

PS C:> helm chart push tsmacrtestwus2acrhelm.azurecr.io/charts/helm-test-chart-v3:0.2.0
The push refers to repository [tsmacrtestwus2acrhelm.azurecr.io/charts/helm-test-chart-v3]
ref: tsmacrtestwus2acrhelm.azurecr.io/charts/helm-test-chart-v3:0.2.0
digest: daf106a05ad2fe075851a3ab80f037020565c75c5be06936179b882af1858e6a
size: 3.7 KiB
name: helm-test-chart-v3
version: 0.2.0
0.2.0: pushed to remote (1 layer, 3.7 KiB total)

PS C:> helm chart push tsmacrtestwus2acrhelm.azurecr.io/my-helm-charts/helm-test-chart-v3:0.2.0
The push refers to repository [tsmacrtestwus2acrhelm.azurecr.io/my-helm-charts/helm-test-chart-v3]
ref: tsmacrtestwus2acrhelm.azurecr.io/my-helm-charts/helm-test-chart-v3:0.2.0
digest: daf106a05ad2fe075851a3ab80f037020565c75c5be06936179b882af1858e6a
size: 3.7 KiB
name: helm-test-chart-v3
version: 0.2.0
0.2.0: pushed to remote (1 layer, 3.7 KiB total)

And here is the result:

An important thing to note here is that:

  • Helm charts saved to ACR using the OCI way will appear in the Azure portal.

The approach here is a bit different than the Helm 2 way. You don’t need to package the chart into a TAR – saving the chart to the local registry is enough.

We need to do one last thing and we are ready to summarize the experience. Let’s use the helm search command to find our charts (of course using Helm 3). Here is the result of the search:

PS C:> helm search repo helm-test-chart 
NAME                           CHART VERSION     APP VERSION     DESCRIPTION 
acrrepo/helm-test-chart-v2     0.1.0             1.0             A Helm chart for Kubernetes 
acrrepo/helm-test-chart-v3     0.1.0             1.16.0          A Helm chart for Kubernetes

It yields the same result like the one we saw in Using Helm 3 to Push and Pull Charts the Helm 2 Way. The reason is that the helm search command doesn’t work for charts stored the OCI way. This is one limitation that the Helm team is working on fixing and is documented in Support for OCI registries in helm search #9983 issue on GitHub.

Summary of the ACR and Helm 3 Experience

To summarize the ACR and Helm 3 experience, here are the main takeaways:

  • First, you can use the Helm 3 CLI in conjunction with the az acr helm command to push and pull charts the Helm 2 way. Those charts will not appear in the Azure portal.
  • You can also use the Helm 3 CLI to (natively) push charts to ACR the OCI way. Those charts will appear in the Azure portal.
  • OCI features are experimental in the Helm 3 client and certain functionalities like helm search and helm repo do not work for charts saved and pushed the OCI way.

Conclusion

To wrap it up, when working with Helm charts and ACR (as well as other OCI compliant registries), you need to be careful which commands you use. As a general rule, always use the Helm 3 CLI and make a conscious decision whether you want to store the charts as OCI artifacts (the OCI way) or using the legacy Helm approach (the Helm 2 way). This should be a transition period and hopefully, at some point in the future, Helm will improve the support for OCI compliant charts and support the same scenarios that are currently enabled for legacy chart repositories.

Here is a summary table that gives a quick overview of what we described in this post.

wdt_ID Functionality Helm 2 CLI (legacy) Helm 3 (legacy) Helm 3 (OCI)
1 helm add repo Yes Yes No
2 helm search Yes Yes No
3 helm chart push No No Yes
4 helm chart list No No Yes
5 az acr helm push Yes Yes No
6 Chart appears in Azure portal No No Yes
7 Example chart helm-test-chart-v2 helm-test-chart-v3 helm-test-chart-v3
8 Example chart version 0.1.0 0.1.0 0.2.0

In the last two posts Configuring a hierarchy of IoT Edge Devices at Home Part 1 – Configuring the IT Proxy and Configuring a hierarchy of IoT Edge Devices at Home Part 2 – Configuring the Enterprise Network (IT) we have set up the proxy and the top layer of the hierarchical IoT Edge network. This post will describe how to set up and configure the first nested layer in the Purdue network architecture for manufacturing – the Business Planning and Logistics (IT) layer. So, let’s get going!

A lot of the steps to configure the IoT Edge runtime on the device are repeating from the previous post. Nevertheless, though, let’s go over them again.

Configuring the Network for Layer 4 Device

The tricky part with the Layer 4 device is that it should have no Internet access. The problem with that is that you will need Internet access to install the IoT Edge runtime. In a typical scenario, the Layer 4 device will come with the IoT Edge runtime pre-installed and you will only need to plug in the device and configure it to talk to its parent. For our scenario though, we need to have the device Internet-enabled while installing the IoT Edge and lock it down after that. Here is the initial dhcpcd.conf configuration for the Layer 4 device:

interface eth0
static ip_address=10.16.6.4/16
static routers=10.16.8.4
static domain_name_servers=1.1.1.1

The only difference in the network configuration is the IP address of the device. This configuration will allow us to download the necessary software to the device before we restrict its access.

While speaking of necessary software, let’s install the netfilter-persistent and iptables-persistent and have them ready for locking the device networking later on:

sudo DEBIAN_FRONTEND=noninteractive apt install -y netfilter-persistent iptables-persistent

Create the Azure Cloud Resources

We already have created the resource group and the IoT Hub resource in Azure. The only new resource that we need to create is the IoT Edge device for Layer 4. One difference here is that, unlike the Layer 5 device, the Layer 4 device will have a parent. And this parent will be the Layer 5 device. Use the following command to create the Layer 4 device

az iot hub device-identity create --device-id L4-edge-pi --edge-enabled --hub-name tsm-ioth-eus-test
az iot hub device-identity parent set --device-id L4-edge-pi --parent-device-id L5-edge-pi --hub-name tsm-ioth-eus-test
az iot hub device-identity connection-string show --device-id L5-edge-pi --hub-name tsm-ioth-eus-test

The first command creates the IoT Edge device for Layer 4. The second command sets the Layer 5 device as the parent for the Layer 4 device. And, the third command returns the connection string for the Layer 4 device that we can use for the IoT Edge runtime configuration.

Installing Azure IoT Edge Runtime on the Layer 4 Device

Surprisingly to me, it the time between posting my last post and starting this one, the Tutorial to Create a Hierarchy of IoT Edge Devices changed. I will still use the steps from my previous post to provide consistency.

Creating Certificates

We will start again with the creation of certificates – this time for the Layer 4 device. We already have the root and the intermediate certificates for the chain. The only certificate that we need to create is the Layer 4 device certificate. We will use the following command for that:

./certGen.sh create_edge_device_ca_certificate "l4_certificate"

You should see the new certificates in the <WORKDIR>/certs folder:

drwxrwxrwx 1 toddysm toddysm 4096 Apr 16 10:46 .
drwxrwxrwx 1 toddysm toddysm 4096 Apr 16 10:46 ..
...
-rwxrwxrwx 1 toddysm toddysm 5891 Apr 16 10:46 iot-edge-device-ca-l4_certificate-full-chain.cert.pem
-r-xr-xr-x 1 toddysm toddysm 1931 Apr 16 10:46 iot-edge-device-ca-l4_certificate.cert.pem
-rwxrwxrwx 1 toddysm toddysm 7240 Apr 16 10:46 iot-edge-device-ca-l4_certificate.cert.pfx
...

The private keys are in the <WORKDIR>/private folder together with the root and intermediate keys:

drwxrwxrwx 1 toddysm toddysm 4096 Apr 16 10:46 .
drwxrwxrwx 1 toddysm toddysm 4096 Apr 16 10:46 ..
-r-xr-xr-x 1 toddysm toddysm 3326 Apr  8 14:54 azure-iot-test-only.intermediate.key.pem
-r-xr-xr-x 1 toddysm toddysm 3326 Apr  8 14:54 azure-iot-test-only.root.ca.key.pem
-r-xr-xr-x 1 toddysm toddysm 3243 Apr 16 10:46 iot-edge-device-ca-l4_certificate.key.pem
...

As before, we need to upload the relevant certificates to the Layer 4 device. From the <WORKDIR> folder on the workstation use the following commands:

scp ./certs/azure-iot-test-only.root.ca.cert.pem pi@pi-l4:.
scp ./certs/iot-edge-device-ca-l4_certificate-full-chain.cert.pem pi@pi-l4:.
scp ./private/iot-edge-device-ca-l4_certificate.key.pem pi@pi-l4:.

Connect to the Layer 4 device with ssh pi@pi-l4 and install the root CA using:

sudo cp ~/azure-iot-test-only.root.ca.cert.pem /usr/local/share/ca-certificates/azure-iot-test-only.root.ca.cert.pem.crt
sudo update-ca-certificates

Verify that the cert is installed with:

ls /etc/ssl/certs/ | grep azure

We will also move the device certificate chain and the private key to the /var/secrets folder:

sudo mkdir /var/secrets
sudo mkdir /var/secrets/aziot
sudo mv iot-edge-device-ca-l4_certificate-full-chain.cert.pem /var/secrets/aziot/
sudo mv iot-edge-device-ca-l4_certificate.key.pem /var/secrets/aziot/

We are all set to install and configure the Azure IoT Edge runtime.

Configuring the IoT Edge Runtime

As mentioned before the steps are described in Install or uninstall Azure IoT Edge for Linux and here is the configuration that we need to use in /etc/aziot/config.tomlfile:

hostname = "10.16.6.4"

parent_hostname = "10.16.7.4"

trust_bundle_cert = "file:///etc/ssl/certs/azure-iot-test-only.root.ca.cert.pem.pem"

[provisioning]
source = "manual"
connection_string = "HostName=tsm-ioth-eus-test.azure-devices.net;DeviceId=L4-edge-pi;SharedAccessKey=s0m#VeRYcRYpT1c$tR1nG"

[agent]
name = "edgeAgent"
type = "docker"
[agent.config]
image = "10.16.7.4:<port>/azureiotedge-agent:1.2"

[edge_ca]
cert = "file:///var/secrets/aziot/iot-edge-device-ca-l4_certificate-full-chain.cert.pem"
pk = "file:///var/secrets/aziot/iot-edge-device-ca-l4_certificate.key.pem"

Note two differences in this configuration compared to the configuration for the Layer 5 device:

  1. There is a parent_hostname configuration that has the Layer 5 device’s IP address as a value.
  2. The agent image is pulled from the parent registry 10.16.7.4:<port>/azureiotedge-agent:1.2 instead from mcr.microsoft.com (don’t forget to remove/change the <port> in the configuration depending on how you set up the parent registry).

Also, if the registry running on your Layer 5 device requires authentication, you will need to provide credentials in the [agent.config.auth] section.

Restricting the Network for the Layer 4 Device

Now, that we have the IoT Edge runtime configured, we need to restrict the traffic to and from the Layer 4 device. Here is what are we going to do:

  • We will allow SSH connectivity only from the IP addresses in the Demo/Support network, i.e. where our workstation is. However, we will use a stricter CIDR to allow only  10.16.9.x addresses. Here are the commands that will make that configuration:
    sudo iptables -A INPUT -i eth0 -p tcp --dport 22 -s 10.16.9.0/24 -m state --state NEW,ESTABLISHED -j ACCEPT
    sudo iptables -A OUTPUT -p tcp --sport 22 -d 10.16.9.0/24 -m state --state ESTABLISHED -j ACCEPT
  • We will enable connectivity on the following ports 8080 (or whatever the registry port is), 8883, 443, and 5671 to the Layer 5 network only, i.e. 10.16.7.x addresses. Here the commands that will make that configuration:
    sudo iptables -A OUTPUT -p tcp --dport <registry-port> -d 10.16.7.0/24 -m state --state NEW,ESTABLISHED -j ACCEPT
    sudo iptables -A INPUT -i eth0 -p tcp --sport <registry-port> -s 10.16.7.0/24 -m state --state ESTABLISHED -j ACCEPT
    sudo iptables -A OUTPUT -p tcp --dport 8883 -d 10.16.7.0/24 -m state --state NEW,ESTABLISHED -j ACCEPT
    sudo iptables -A INPUT -i eth0 -p tcp --sport 8883 -s 10.16.7.0/24 -m state --state ESTABLISHED -j ACCEPT
    sudo iptables -A OUTPUT -p tcp --dport 5671 -d 10.16.7.0/24 -m state --state NEW,ESTABLISHED -j ACCEPT
    sudo iptables -A INPUT -i eth0 -p tcp --sport 5671 -s 10.16.7.0/24 -m state --state ESTABLISHED -j ACCEPT
    sudo iptables -A OUTPUT -p tcp --dport 443 -d 10.16.7.0/24 -m state --state NEW,ESTABLISHED -j ACCEPT
    sudo iptables -A INPUT -i eth0 -p tcp --sport 443 -s 10.16.7.0/24 -m state --state ESTABLISHED -j ACCEPT
    
  • We will also enable connectivity on the following ports 8080 (or whatever the registry port is), 8883, 443, and 5671 from the OT Proxy network only, i.e. 10.16.5.x addresses. Here the commands that will make that configuration:
    sudo iptables -A OUTPUT -p tcp --dport <registry-port> -d 10.16.5.0/24 -m state --state ESTABLISHED -j ACCEPT
    sudo iptables -A INPUT -i eth0 -p tcp --sport <registry-port> -s 10.16.5.0/24 -m state --state NEW,ESTABLISHED -j ACCEPT
    sudo iptables -A OUTPUT -p tcp --dport 8883 -d 10.16.5.0/24 -m state --state ESTABLISHED -j ACCEPT
    sudo iptables -A INPUT -i eth0 -p tcp --sport 8883 -s 10.16.5.0/24 -m state --state NEW,ESTABLISHED -j ACCEPT
    sudo iptables -A OUTPUT -p tcp --dport 5671 -d 10.16.5.0/24 -m state --state ESTABLISHED -j ACCEPT
    sudo iptables -A INPUT -i eth0 -p tcp --sport 5671 -s 10.16.5.0/24 -m state --state NEW,ESTABLISHED -j ACCEPT
    sudo iptables -A OUTPUT -p tcp --dport 443 -d 10.16.5.0/24 -m state --state ESTABLISHED -j ACCEPT
    sudo iptables -A INPUT -i eth0 -p tcp --sport 443 -s 10.16.5.0/24 -m state --state NEW,ESTABLISHED -j ACCEPT
    
  • Last, we need to disable any other traffic to and from the Layer 4 device. Here the commands for that:
    sudo iptables -P INPUT DROP
    sudo iptables -P OUTPUT DROP
    sudo iptables -P FORWARD DROP
    

Don’t forget to save the configuration and restart the device with:

sudo netfilter-persistent save
sudo systemctl reboot

Deploying Modules on the Layer 4 Device

Similar to the deployment of modules on the Layer 5 Device, we need to deploy the standard $edgeAgent and $edgeHub modules as well as a registry module. For this layer, I experimented with the API proxy module. Hence, the configurations include also the $IoTEdgeApiProxy module. There is also an API Proxy configuration that you need to set on the $IoTEdgeApiProxy module’s twin – for details, take a look at Configure the API proxy module for your gateway hierarchy scenario. Here are the Gists that you can use for deploying the modules on the Layer 4 device:

By using the API proxy, you can also leverage the certificates deployed on the device to connect to the registry over HTTPS. Thus, you don’t need to configure Docker on the clients to connect to an insecured registry.

Testing the Registry Module

To test the registry module, use the following commands:

docker login -u <sync_token_name> -p <sync_token_password> 10.16.6.4:<port_if_any>
docker pull 10.16.6.4:<port_if_any>/azureiotedge-simulated-temperature-sensor:1.0

Now, we have set up the first nested layer (Layer 4) for our nested IoT Edge infrastructure. In the next post, I will describe the steps to set up the OT Proxy and the second nested layer for IoT Edge (Layer 3).

Image by ejaugsburg from Pixabay

In my last post, I started explaining how to configure the IoT Edge device hierarchy’s IT Proxy. This post will go one layer down and set up the Layer 5 device from the Purdue model for manufacturing networks.

Reconfiguring The Network

While implementing network segregation in the cloud is relatively easy, implementing it with a limited number of devices and a consumer-grade network switch requires a bit more design. In Azure, the routing between subnets within a VNet is automatically configured using the subnet gateways. In my case, though, the biggest challenge was connecting each Raspberry Pi device to two separate subnets using the available interfaces – one WiFi and one Ethernet. In essence, I couldn’t use the WiFi interface because then I was limited to my Eero’s (lack of) capabilities (Well, I have an idea how to make this work, but this will be a topic of a future post :)). Thus, my only option was to put all devices in the same subnet and play with the firewall on each device to restrict the traffic. Here is how the picture looks like.

To be able to connect to each individual device from my laptop (i.e. playing the role of the jumpbox from the Azure IoT Edge for Industrial IoT sample), I had to configure a second network interface on it and give it an IP address from the 10.16.0.0/16 network (the Workstation in the picture above). There are multiple ways to do that with easiest one to buy a USB networking dongle and connect it to the switch with the rest of the devices. One more thing that will be helpful to do to speed up the work is to edit the /etc/hosts file on my laptop and add DNS names for each of the devices:

10.16.8.4       pi-itproxy
10.16.7.4       pi-l5
10.16.6.4       pi-l4
10.16.5.4       pi-otproxy
10.16.4.4       pi-l3
10.16.3.4       pi-opcua

The next thing I had to do is to go back to the IT Proxy and change the subnet mask to enable a broader addressing range. Connect to the IT Proxy device that we configured in Configuring a hierarchy of IoT Edge Devices at Home Part 1 – Configuring the IT Proxy using ssh pi@10.16.8.4. Then edit the DHCP configuration file with sudo vi /etc/dhcpcd.conf and change the subnet mask from /24 to /16:

interface eth0
static ip_address=10.16.8.4/16

Now, the IT Proxy is configured to address the broader 10.16.0.0/16 network. To restrict the communication between the devices, we will configure each individual device’s firewall.

Now, getting back to the Layer 5 configuration. Normally, the Layer 5 device is configured to have access to the Internet via the IT Proxy as a gateway. Edit the DHCP configuration file with sudo vi /etc/dhcpcd.conf and add the following at the end:

interface eth0
static ip_address=10.16.7.4/16
static routers=10.16.8.4
static domain_name_servers=1.1.1.1

Note that I added the Cloudflare’s DNS server to the list of DNS servers. The reason for that is that the proxy device will not do DNS resolution. You can also configure it with your home network’s DNS server. This should be enough for now, and we can start installing the IoT Edge runtime on the Layer 5 device.

Testing from my laptop, I am able to connect to both devices using their 10.16.0.0/16 network IP addresses:

ssh pi@pi-itproxy

connects me to the IT Proxy, and:

ssh pi@pi-l5

connects me to the Layer 5 Iot Edge device.

Create the Azure Cloud Resources

Before we start installing the Azure IoT Edge runtime on the device, we need to create an Azure IoT Hub and register the L5 device with it. This is well described in Microsoft’s documentation explaining how to deploy code to a Linux device. Here are the Azure CLI commands that will create the resource group and the IoT Hub resources:

az group create --name tsm-rg-eus-test --location eastus
az iot hub create --resource-group tsm-rg-eus-test --name tsm-ioth-eus-test --sku S1 --partition-count 4

Next is to register the IoT Edge device and get the connection string. Here the commands:

az iot hub device-identity create --device-id L5-edge-pi --edge-enabled --hub-name tsm-ioth-eus-test
az iot hub device-identity connection-string show --device-id L5-edge-pi --hub-name tsm-ioth-eus-test

The last command will return the connection string that we should use to connect the new device to Azure IoT Hub. It has the following format:

{
  "connectionString": "HostName=tsm-ioth-eus-test.azure-devices.net;DeviceId=L5-edge-pi;SharedAccessKey=s0m#VeRYcRYpT1c$tR1nG"
}

Save the connection string because we will need it in the next section to configure the Layer 5 device.

Installing Azure IoT Edge Runtime on the Layer 5 Device

Installing the Azure IoT Edge is described in the Tutorial: Create a hierarchy of IoT Edge devices article. The tutorial describes how to build a hierarchy with two devices only. The important part of the nested configuration is to generate the certificates and transfer them to the devices. So, let’s go over this step by step for the Layer 5 device.

Create Certificates

The first thing we need to do on the workstation, after cloning the IoT Edge GitHub repository with

git clone https://github.com/Azure/iotedge.git

,is to generate the root and the intermediate certificates (check folder /tools/CACertificates):

./certGen.sh create_root_and_intermediate

Those certificates will be used to generate the individual devices’ certificates. For now, we will only create a certificate for the Layer 5 device.

./certGen.sh create_edge_device_ca_certificate "l5_certificate"

After those two command, you should have the following in your <WORKDIR>/certs folder:

drwxrwxrwx 1 toddysm toddysm 4096 Apr  8 15:04 .
drwxrwxrwx 1 toddysm toddysm 4096 Apr  8 15:04 ..
-rwxrwxrwx 1 toddysm toddysm 3960 Apr  8 14:54 azure-iot-test-only.intermediate-full-chain.cert.pem
-r-xr-xr-x 1 toddysm toddysm 1976 Apr  8 14:54 azure-iot-test-only.intermediate.cert.pem
-rwxrwxrwx 1 toddysm toddysm 5806 Apr  8 14:54 azure-iot-test-only.intermediate.cert.pfx
-r-xr-xr-x 1 toddysm toddysm 1984 Apr  8 14:54 azure-iot-test-only.root.ca.cert.pem
-rwxrwxrwx 1 toddysm toddysm 5891 Apr  8 15:04 iot-edge-device-ca-l5_certificate-full-chain.cert.pem
-r-xr-xr-x 1 toddysm toddysm 1931 Apr  8 15:04 iot-edge-device-ca-l5_certificate.cert.pem
-rwxrwxrwx 1 toddysm toddysm 7240 Apr  8 15:04 iot-edge-device-ca-l5_certificate.cert.pfx

The content of the <WORKDIR>/private folder should be the following:

drwxrwxrwx 1 toddysm toddysm 4096 Apr  8 15:04 .
drwxrwxrwx 1 toddysm toddysm 4096 Apr  8 15:04 ..
-r-xr-xr-x 1 toddysm toddysm 3326 Apr  8 14:54 azure-iot-test-only.intermediate.key.pem
-r-xr-xr-x 1 toddysm toddysm 3326 Apr  8 14:54 azure-iot-test-only.root.ca.key.pem
-r-xr-xr-x 1 toddysm toddysm 3243 Apr  8 15:04 iot-edge-device-ca-l5_certificate.key.pem

We need to upload the relevant certificates to the Layer 5 device. From the <WORKDIR> folder on the workstation issue the following commands:

scp ./certs/azure-iot-test-only.root.ca.cert.pem pi@pi-l5:.
scp ./certs/iot-edge-device-ca-l5_certificate-full-chain.cert.pem pi@pi-l5:.

The above two commands will upload the public key for the root certificate and the certificate chain to the Layer 5 device. The following command will upload the device’s private key:

scp ./private/iot-edge-device-ca-l5_certificate.key.pem pi@pi-l5:.

Connect to the Layer 5 device with ssh pi@pi-l5 and install the root CA using:

sudo cp ~/azure-iot-test-only.root.ca.cert.pem /usr/local/share/ca-certificates/azure-iot-test-only.root.ca.cert.pem.crt
sudo update-ca-certificates

The response from this command should be:

Updating certificates in /etc/ssl/certs...
1 added, 0 removed; done.
Running hooks in /etc/ca-certificates/update.d...
done.

To verify that the cert is installed, you can use:

ls /etc/ssl/certs/ | grep azure

We should also move the device certificate chain and the private key to the /var/secrets folder:

sudo mkdir /var/secrets
sudo mkdir /var/secrets/aziot
sudo mv iot-edge-device-ca-l5_certificate-full-chain.cert.pem /var/secrets/aziot/
sudo mv iot-edge-device-ca-l5_certificate.key.pem /var/secrets/aziot/

Configuring the IoT Edge Runtime

Installation and configuration of the IoT Edge runtime is described in the official Microsoft documentation (see Install or uninstall Azure IoT Edge for Linux ) but here the values you should set for the Layer 5 device configuration when editing the /etc/aziot/config.toml file:

hostname = "10.16.7.4"

trust_bundle_cert = "file:///etc/ssl/certs/azure-iot-test-only.root.ca.cert.pem.pem"

[provisioning]
source = "manual"
connection_string = "HostName=tsm-ioth-eus-test.azure-devices.net;DeviceId=L5-edge-pi;SharedAccessKey=s0m#VeRYcRYpT1c$tR1nG"

[agent.config]
image = "mcr.microsoft.com/azureiotedge-agent:1.2.0-rc4"
[edge_ca]
cert = "file:///var/secrets/aziot/iot-edge-device-ca-l5_certificate-full-chain.cert.pem"
pk = "file:///var/secrets/aziot/iot-edge-device-ca-l5_certificate.key.pem"

Apply the IoT Edge configuration with sudo iotedge config apply and check it with sudo iotedge check. At this point, the IoT Edge runtime should be running on the Layer 5 device. You can check the status with sudo iotedge system status.

Deploying Modules on the Layer 5 Device

The last thing we need to do is to deploy the required modules that will support the lower-layer device. In addition to the standard IoT Edge modules $edgeAgent and $edgeHub we need also to deploy a registry module.  The registry module is intended to serve the container images for the Layer 4 device. You should also deploy the API proxy module to enable a single endpoint for all services deployed on the IoT Edge device. Here are several Gists that you can use for deploying the modules on the Layer 5 device:

Testing the Registry Module

Before setting up the next layer of the hierarchical IoT edge infrastructure, we need to make sure that Layer 5 registry module is working properly. Without it, you will not be able to set up the next layer. From your laptop, you should be able to pull Docker images from the registry module. Depending on how you have deployed the registry module (using one of the Gists above), the pull commands should look something like this:

docker login -u <registry_username> -p <registry_password> 10.16.8.4:<port_if_any>
docker pull 10.16.8.4:<port_if_any>/azureiotedge-simulated-temperature-sensor:1.0

Now, we have set up the top layer (Layer 5) for our nested IoT Edge infrastructure. In the next post, I will describe the steps to set up the first nested layer of the hierarchical infrastructure – Layer 4.

Image by falco from Pixabay.

To provide support for the hierarchical Azure IoT Edge scenarios we started working on a connected registry implementation that will allow extension of the Azure container registry functionality to on-premises. For those of you who are not familiar with what a hierarchical IoT Edge scenario is, take a look at the Purdue network model used in the ISA 95 and ISA 99 standards – TL;DR: it is the network architecture that allows segregation of OT and IT traffic in manufacturing networks. While the Azure IoT team has provided a sample of the hierarchical IoT Edge environment, I wanted to reproduce it using physical ARM-based devices.

The problem with configuring a hierarchy of IoT Edge devices at home is that home networking devices do not allow advanced network configurations that you would normally expect from an enterprise-grade switch or router. While my Eero is good at its mesh WiFi capabilities, it is a very poor implementation of a switch and doesn’t allow the creation of multiple virtual networks or WiFI SSIDs. The routing capabilities are also quite limited, which to be honest, impact the ability to create secure IoT networks at home. However, I derail… To implement my configuration, I gathered a bunch of Raspberry Pi 4s, Pi Zeros, and an Nvidia Jetson Nano and got to work.

Here is the big picture of what are we implementing:

Hierarchy of IoT Edge devices - Purdue Networks

 

In this post, I will walk over the configuration of the IT Proxy in the IT DMZ layer shown in the picture above. For that, I decided to use a Raspberry Pi 4 device and configure it with the Squid proxy similar to the Azure IoT sample linked above. Before that though, let’s look at the device configuration.

Configuring the Raspberry Pi Network Interfaces

Raspberry Pi 4 has two network interfaces – WiFi one and Ethernet one. The idea here is to configure the WiFi interface to connect to my home network using the home network IP range  192.168.0.0/24 and wire the Ethernet interface to a simple switch and assign it a static IP address from the  10.16.8.0/24 network. Routing between the two interfaces should also be established so traffic from the  10.16.8.0/24 network can flow to the 192.168.0.0/24 network. Pulling some details from the Setting up a Raspberry Pi as a routed wireless access point article on the Raspberry Pi’s official website, I ended up with the following configuration.

Warning: Be careful here and don’t copy the commands directly from the Pi’s article! We are routing in the reverse direction – from the Ethernet interface to the WiFi interface.

  • Install the netfilter-persistent and the iptables-persistent plugin to be able to persist the routing rules between reboots:
    sudo DEBIAN_FRONTEND=noninteractive apt install -y netfilter-persistent iptables-persistent
  • Configure a static IP address for the Ethernet interface. To follow the Azure IoT sample, I choose 10.16.8.4. Edit the DHCP configuration file with sudo vi /etc/dhcpcd.conf and add the following at the end:
    interface eth0
    static ip_address=10.16.8.4/24
  • Enable routing by creating a new file with sudo vi /etc/sysctl.d/routed-proxy.conf and add the following in it:
    # Enable IPv4 routing
    net.ipv4.ip_forward=1
  • Next, create a firewall rule with:
    sudo iptables -t nat -A POSTROUTING -o wlan0 -j MASQUERADE
  • Last, save the configuration and reboot the device:
    sudo netfilter-persistent save
    sudo systemctl reboot

Checking the configuration with ifconfig after reboot should give you something similar to:

pi@raspberrypi:~ $ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.16.8.4  netmask 255.255.255.0  broadcast 10.16.8.255
        inet6 fe80::6816:2aa8:ab8d:9f93  prefixlen 64  scopeid 0x20<link>
        ether dc:a6:32:77:fb:99  txqueuelen 1000  (Ethernet)
        RX packets 5565  bytes 1359978 (1.2 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 131  bytes 14399 (14.0 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 22  bytes 1848 (1.8 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 22  bytes 1848 (1.8 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

wlan0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.0.101  netmask 255.255.0.0  broadcast 192.168.255.255
        inet6 fe80::6788:d8e2:4e8a:558  prefixlen 64  scopeid 0x20<link>
        ether dc:a6:32:77:fb:9a  txqueuelen 1000  (Ethernet)
        RX packets 7619  bytes 1671937 (1.5 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1494  bytes 346279 (338.1 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

To test the connectivity through the newly configured router, I had to set up my Mac with a wired connection using a USB Ethernet adapter. Here the configuration:

traceroute yields satisfactory results showing that the traffic goes through the Raspberry Pi:

toddysm@MacBook-Pro ~ % traceroute 192.168.0.101
traceroute to 192.168.0.101 (192.168.0.101), 64 hops max, 52 byte packets
 1  192.168.0.101 (192.168.0.101)  9.146 ms  7.669 ms  10.125 ms
toddysm@MacBook-Pro ~ % traceroute 172.217.3.164
traceroute to 172.217.3.164 (172.217.3.164), 64 hops max, 52 byte packets
 1  10.16.8.4 (10.16.8.4)  2.555 ms  0.661 ms  0.377 ms
 2  * *^C
toddysm@MacBook-Pro ~ %

Installing and Configuring Squid Proxy

Installing Squid proxy is as trivial as typing the following command:

sudo apt install squid

However, there are two configuration settings you need to do to make sure that the proxy can be used from machines on the local network. Those settings are available in the Squid configuration file /etc/squid/squid.conf:

  • First, the port configuration for the proxy is set to bind to any IP address but when the proxy starts, it binds to the IPv6 addresses instead of the IPv4 ones. You need to change the following line:
    http_port 3128

    and add the IP address of the Ethernet port like this:

    http_port 10.16.8.4:3128
  • Second, access to the proxy is enabled only from the localhost. Uncomment the following line to enable access from the localnet:
    http_access allow localnet

    Also, make sure that the localnet is defined as:

    acl localnet src 0.0.0.1-0.255.255.255  # RFC 1122 "this" network (LAN)
    acl localnet src 10.0.0.0/8             # RFC 1918 local private network (LAN)
    acl localnet src 100.64.0.0/10          # RFC 6598 shared address space (CGN)
    acl localnet src 169.254.0.0/16         # RFC 3927 link-local (directly plugged) machines
    acl localnet src 172.16.0.0/12          # RFC 1918 local private network (LAN)
    acl localnet src 192.168.0.0/16         # RFC 1918 local private network (LAN)
    acl localnet src fc00::/7               # RFC 4193 local private network range
    acl localnet src fe80::/10              # RFC 4291 link-local (directly plugged) machines

    The latter should already be done later in the file.

To make sure that the proxy is properly configured, I changed the networking configuration in my Firefox browser on my Mac as follows:

I also turned off my Mac’s WiFi to make sure that the traffic goes through the wired interface and uses the proxy. The test was successful and now I have the top layer of the hierarchy of IoT Edge network configured.

In the next post, I will go over configuring the L5 of the Purdue network architecture and installing Azure IoT Edge runtime on it.

With the recent Solorigate incident, a lot of emphasis is put on determining the origin of the software running in an enterprise. For Docker container images, this will mean to embed in the image the Dockerfile the image was built from. However, tracking down the software origin is not so trivial to do. For closed-source software, we blindly trust the vendors and if we are lucky enough, we may get a signed piece of code. For open-source one, we rarely check the SHA signature and never even think of verifying what source code this binary was produced from. In talks with customers, I quite often hear them asking, how can they verify what sources a container image is built from. They want to attribute each image with metadata that links to the Dockerfile used to build the image as well as the Git commit and the developer who triggered the build.

There are many articles that discuss this problem. Here are two recent examples. Richard Lander from the Microsoft .NET team writes in his blog post Staying safe with .NET containers about the pedigree and provenance of the software we run and how to think about it. Josh Hendrick in his post Embedding source code version information in Docker images offers one solution to the problem.

Josh Hendrick’s proposal is in the direction I would go, but one problem I have with it is that it requires special handling in the application that runs in the container to obtain this information. I would prefer to have this information readily available without the need to run the container image. Docker images and the Open Container Initiative already have specified ways to do that without adding special files to your image. In this post, I will outline another way you can embed this information into your images and easily retrieve it without any changes to your application.

Using Docker Image Labels

Docker images spec has already built-in functionality to add labels to the image. Labels are intended to be set during build time. They also show up when inspecting the image using docker image inspect, which makes them the right choice to specify the Dockerfile and the other build origin details. One more argument that makes them the right choice for this information is that the labels are layers in the image, and thus immutable. If you change the label in an image the resulting image SHA will change.

To demonstrate how labels can be used to embed the Dockerfile and other origin information into the Docker image, I have published a dynamic labels sample on GitHub. The sample uses a base Python image and implements a simple functionality to print the container’s environment variables. Let’s walk through it step by step.

The Dockerfile is quite simple.

FROM python:slim
ARG IMAGE_COMMITTER
ARG IMAGE_DOCKERFILE
ARG IMAGE_COMMIT_SHA
LABEL "build.user"=${IMAGE_COMMITTER}
LABEL "build.sha"=${IMAGE_COMMIT_SHA}
LABEL "build.dockerfile"=${IMAGE_DOCKERFILE}
ADD ./samples/dynamic-labels/source /
CMD ["python", "/show_environment.py"]

Lines 2-4 define the build arguments that need to be set during the build of the image. Lines 5-7 set the three labels build.user, build.sha, and build.dockerfilethat we want to embed in the image. build.dockerfile is the URL to the Dockerfile in the GitHub repository, while the build.sha is the Git commit that triggers the build. If you build the image locally with some dummy build arguments you will see that new layers are created for each of the lines 5-7.

toddysm@MacBook-Pro ~ % docker build -t test --build-arg IMAGE_COMMITTER=toddysm --build-arg IMAGE_DOCKERFILE=https://test.com --build-arg IMAGE_COMMIT_SHA=12345 -f .\samples\dynamic-labels\Dockerfile .
Sending build context to Docker daemon  376.3kB
Step 1/9 : FROM python:slim
 ---> 8c84baace4b3
Step 2/9 : ARG IMAGE_COMMITTER
 ---> Running in 71ad05f20d20
Removing intermediate container 71ad05f20d20
 ---> fe56c62b9903
Step 3/9 : ARG IMAGE_DOCKERFILE
 ---> Running in fe468c44e9fc
Removing intermediate container fe468c44e9fc
 ---> b776dca57bd7
Step 4/9 : ARG IMAGE_COMMIT_SHA
 ---> Running in 849a82225c31
Removing intermediate container 849a82225c31
 ---> 3a4c6c23a699
Step 5/9 : LABEL "build.user"=${IMAGE_COMMITTER}
 ---> Running in fd4bfb8d5b5b
Removing intermediate container fd4bfb8d5b5b
 ---> 2e9be17c48ff
Step 6/9 : LABEL "build.sha"=${IMAGE_COMMIT_SHA}
 ---> Running in 892323d73495
Removing intermediate container 892323d73495
 ---> b7bc6559629d
Step 7/9 : LABEL "build.dockerfile"=${IMAGE_DOCKERFILE}
 ---> Running in 98687b8dd9fb
Removing intermediate container 98687b8dd9fb
 ---> 35e97d273cbc
Step 8/9 : ADD ./samples/dynamic-labels/source /
 ---> 9e71859892b1
Step 9/9 : CMD ["python", "/show_environment.py"]
 ---> Running in 366b1b6c3bea
Removing intermediate container 366b1b6c3bea
 ---> e7cb39a21c2a
Successfully built e7cb39a21c2a
Successfully tagged test:latest

You can inspect the image and see the labels by issuing the command docker image inspect --format='{{json .Config.Labels}}' <imagename>.

toddysm@MacBook-Pro ~ % docker image inspect --format='{{json .Config.Labels}}' test | jq
{
  "build.dockerfile":"https://test.com",
  "build.sha":"12345",
  "build.user":"toddysm"
}

Now, let’s automate the process with the help of GitHub Actions. I have created one GitHub Action to build and push the image to DockerHub and another to build and push to Azure Container Registry (ACR). Both actions are similar in the steps they use. The first two steps are the same for both actions. They will build the URL to the Dockerfile using the corresponding GitHub Actions variables:

- name: 'Set environment variable for Dockerfile URL for push'
  if: ${{ github.event_name == 'push' }}
  run: echo "DOCKERFILE_URL=${GITHUB_SERVER_URL}/${GITHUB_REPOSITORY}/blob/${GITHUB_REF#refs/*/}/samples/dynamic-labels/Dockerfile" >> $GITHUB_ENV

- name: 'Set environment variable for Dockerfile URL for pull request'
  if: ${{ github.event_name == 'pull_request' }}
  run: echo "DOCKERFILE_URL=${GITHUB_SERVER_URL}/${GITHUB_REPOSITORY}/blob/${GITHUB_BASE_REF#refs/*/}/samples/dynamic-labels/Dockerfile" >> $GITHUB_ENV

Then, there will be specific steps to sign into DockerHub or Azure. After that, the build steps are the ones where the labels are set. Here, for example, is the build step that buildx and automatically pushes the image to DockerHub:

- name: Build and push
  id: docker_build
  uses: docker/build-push-action@v2
  with:
    context: ./
    file: ./samples/dynamic-labels/Dockerfile
    push: true
    tags: ${{ secrets.DOCKER_HUB_REPONAME }}:build-${{ github.run_number }}
    build-args: |
      IMAGE_COMMITTER=${{ github.actor }}
      IMAGE_DOCKERFILE=${{ env.DOCKERFILE_URL }}
      IMAGE_COMMIT_SHA=${{ github.sha }}

The build step for building the image and pushing to Azure Container Registry uses the traditional docker build approach:

- name: Build and push
  id: docker_build
  uses: azure/docker-login@v1
  with:
    login-server: ${{ secrets.ACR_REGISTRY_LOGIN_SERVER }}
    username: ${{ secrets.ACR_REGISTRY_USERNAME }}
    password: ${{ secrets.ACR_REGISTRY_PASSWORD }}
- run: |
    docker build -f ./samples/dynamic-labels/Dockerfile -t ${{ secrets.ACR_REGISTRY_LOGIN_SERVER }}/${{ secrets.ACR_REPOSITORY_NAME }}:build-${{ github.run_number }} --build-arg IMAGE_COMMITTER=${{ github.actor }} --build-arg IMAGE_DOCKERFILE=${{ env.DOCKERFILE_URL }} --build-arg IMAGE_COMMIT_SHA=${{ github.sha }} .
    docker push ${{ secrets.ACR_REGISTRY_LOGIN_SERVER }}/${{ secrets.ACR_REPOSITORY_NAME }}:build-${{ github.run_number }}

After the actions complete, the images are available in DockerHub and Azure Container Registry. Here is how the image looks like in DockerHub:

Docker container image with labels

If you scroll down a little, you will see the labels that appear in the list of layers:

The URL points you to the Dockerfile that was used to create the image while the commit SHA can be used to identify the latest changes that are done on the project that is used to build the image. If you pull the image locally, you can also see the labels using the command:

toddysm@MacBook-Pro ~ % docker pull toddysm/tmstests:build-36
build-36: Pulling from toddysm/tmstests
45b42c59be33: Already exists
8cd3485318db: Already exists
2f564129f025: Pull complete
cf1573f5a21e: Pull complete
ceec8aed2dab: Pull complete
78b1088f77a0: Pull complete
Digest: sha256:7862c2a31970916fd50d3ab38de0dad74a180374d41625f014341c90c4b55758
Status: Downloaded newer image for toddysm/tmstests:build-36
docker.io/toddysm/tmstests:build-36
toddysm@MacBook-Pro ~ % docker image inspect --format='{{json .Config.Labels}}' toddysm/tmstests:build-36
{
  "build.dockerfile":"https://github.com/CrimsonPinnacle/container-image-inspector/blob/development/samples/dynamic-labels/Dockerfile",
  "build.sha":"e80e6ef86f86a11d6a73aea8d8c41700c4d3d7c5",
  "build.user":"toddysm"
}

To summarize, the benefit of using labels for embedding the Dockerfile and other origin information into the container images is that those are considered immutable layers of the image. Thus, they cannot be changed without changing the image.

Who is Using Docker Image Labels?

Unfortunately, labels are not widely used if at all 🙁 Checking several popular images from DockerHub yields the following results:

toddysm@MacBook-Pro ~ % docker image inspect --format='{{json .Config.Labels}}' busybox | jq
null
toddysm@MacBook-Pro ~ % docker image inspect --format='{{json .Config.Labels}}' alpine | jq 
null
toddysm@MacBook-Pro ~ % docker image inspect --format='{{json .Config.Labels}}' ubuntu | jq 
null

Tracking down the sources from which the Alpine image is built would require much higher effort.

What is Next for Checking Docker Image Origins?

There are a couple of community initiatives that will play a role in determining the origin of container images.

  • Notary V2 will allow images to be signed. Having the origin information embedded into the image and adding an official signature to the image will increase the confidence in the legitimacy of the image.
  • OCI manifest specification allows artifacts (i.e. images) to be annotated with arbitrary metadata. Unfortunately, Docker doesn’t support those yet. Hopefully, in the future, Docker images will add support for arbitrary metadata that can be included in the image manifest.
  • An implementation of metadata service (see metadata service draft from Steve Lasker) as part of the registry will enable additional capabilities to provide origin information for the images.

Summary

While image metadata is great to annotate images with useful information and enable search and querying capabilities, the metadata is kept outside of the image layers and can mutate over time. Verifying the authenticity of the metadata and keeping a history of the changes will be a harder problem to solve. Docker already provides a way to embed the Dockerfile and other image origin information as immutable layers of the image itself. Using dynamically populated Docker image labels, developers can right now provide origin information and increase the supply chain confidence for their images.

In my previous post, Learn More About Your Home Network with Elastic SIEM – Part 1: Setting Up Elastic SIEM, I explained how you could set up Elastic SIEM on a Raspberry Pi[ad]. The next thing you would want to do is to collect the logs from your firewall and analyze them. Before I jump into the technical details, I should warn you that you may… not be able to do the steps below if you rely on consumer products or you use the equipment provided by your ISP.

Let me go on a short rant here! Every self-respected router vendor should allow firewall logs to be sent to an external system. A common approach is to use the SYSLOG protocol to collect the logs. If your router does not have this capability… well, I would suggest you buy a new, more advanced one.

I personally invested in a tiny Netgate SG-1100 box that runs the open-source PFSense router/firewall. You can, of course, install PFSense on your own hardware if you don’t want to buy a new device. PFSense allows you to configure up to three external log servers. Logstash, that we have configured in the previous post, can play the role of an SYSLOG server and send the events to Elasticsearch. Here is how simple the configuration of the PFSense log shipping looks:

The IP address 192.168.11.72 is the address of the Raspberry Pi, where the ELK SIEM is installed and 5140 is the port that Logstash uses to listen for incoming events. Thas is all you need to configure PFSense to send the logs to the ELK SIEM.

Our next step is to configure Logstash to collect the events from PFSense and feed them into an index in Elastic. The following project from Patrick Jennings will help you with the Logstash configuration. If you follow the instructions, you will see the new index show up in Kibana like this:

The last thing we need to do is to create a dashboard in Kibana to show the data collected from the firewall. Patrick Jennings’ project has pre-configured visualizations and a dashboard for the PFSense data. Unfortunately, when you import those, Kibana warns you that those need to be updated. The reason is that they use the old JSON format used by Kibana, and the latest versions require all objects to be described using the Newline Delimited NDJSON format (for more details, visit ndjson.org). The pfSense dashboard and visualization are available in my GitHub repository for Home SIEM.

Now, keep in mind that the pfSense logs will not feed into the SIEM functionality of the Elastic stack because it is not in the Elastic Common Schema (ECS) format. What we have created is just a dashboard that visualizes the firewall log data. Also, the dashboard and the visualizations are built using the pfSense data. If you use a different router/firewall, you will need to update the configuration to visualize the data, and things may not work out of the box. I would be curious to hear feedback on how other routers can send data to ELK.

In subsequent posts, I will describe how you can use Beats to get data from the machines connected to your local network and how you can dig deeper into the collected data.

 

 

 

 

 

Last night I had some free time to play with my network, and I ran  tcpdump out of curiosity. For a while, I’ve been interested to analyze what traffic is going through my home network, and the result of my test pushed me to get to work. I have a bunch of Raspberry Pi devices in my drawers, so, the simplest thing that I can do is get one and install Elastic SIEM on it. For those of you, who don’t know what SIEM is, it stands for Security Information and Event Management. My hope was that with it, I will be able to get a better understanding of the traffic on my home network.

Installing Elastic SIEM on Raspberry Pi

The first thing I had to do is to install the ELK stack on a Raspberry Pi. There are not too many good articles that explain how to set up Elastic SIEM on your Pi. According to Elastic, Elastic SIEM is generally available in Elastic Stack 7.6. So, installing the Elastic Stack should do the work.

A couple of notes before that:

  1. The first thing to keep in mind is that 8GB is the minimum requirement for the ELK stack. You can get around with 2GB Pi, but if you want to run the whole stack (Elasticsearch, Logstash, and Kibana) on a single device, make sure that you order Raspberry Pi 4 Model B Quad Core 64 Bit with 4GB[ad]. Even this one is a stretch if you collect a lot of information. A good option would be to split the stack over two devices: one for Elasticsearch and Kibana, and another one for the Logstash service
  2. Elastic has no builds for Raspbian. Hence, in the instructions below, I will use Debian packages and will describe the steps to install those on the Pi. This will require some custom configs and scripts, so be prepared for that. Well, this article is how to hack the installation and no warranties are provided 🙂
  3. You will not be able to use the ML functionality of Elasticsearch because it is not supported on the low-powered Raspbian device
  4. Below steps assume version 7.7.0 of the ELK stack. If you are installing a different version, make sure that you replace it accordingly in the commands below
  5. Last but not least (in the spirit of no warranties), Elasticsearch has a dependency on libc6that will be ignored and will break future updates. You have to deal with this at your own risk

Here the steps to follow.

Installing Elasticsearch on Raspberry Pi

  1. Set up your Raspberry Pi first. Here are the steps to set up your Rasberry Pi. The Raspberry Pi Imager makes it even easier to set up the OS. Once again, I would recommend using Raspberry Pi 4 Model B Quad Core 64 Bit with 4GB[ad] and a larger SD card[ad] to save logs for longer.
  2. Make sure all packages are up to date, and your Raspbian OS is fully patched.
    sudo apt-get update
    sudo apt-get upgrade
  3. Install the ZIP utility, we will need that later on for the Logstash configuration.
    sudo apt-get install zip
  4. Then, install the Java JRE because Elastic Stack requires it. You can use the open-source JRE to avoid licensing troubles with Oracle’s one.
    sudo apt-get install default-jre
  5. Once this is done, go ahead and download the Debian package for Elasticsearch. Make sure that you download the package with no JDK in it.
    wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.7.0-no-jdk-amd64.deb
  6. Once this is done, go ahead and install Elasticsearch using the package manager.
    sudo dpkg -i --force-all --ignore-depends=libc6 elasticsearch-7.7.0-no-jdk-amd64.deb
  7. Next, we need to configure Elasticsearch to use the installed JDK.
    sudo vi /etc/default/elasticsearch

    Set the JAVA_HOMEto the location of the JDK. Normally this is /usr/lib/jvm/default-java. You can also set the JAVA_HOMEto the same path in the /etc/environmentfile but this is not required.

  8. Last thing you need to do if to disable the ML XPack for Elasticsearch. Change the access mode to the /etc/elasticsearchdirectory first and edit the Elasticsearch configuration file.
    sudo chmod g+w /etc/elasticsearch
    sudo vi /etc/elasticsearch/elasticsearch.yml

    Change the xpack.ml.enabledsetting to falseas follows:

    xpack.ml.enabled: false

The above steps install and configure the Elasticsearch service on your Raspberry Pi. You can start the service with:

sudo service elasticsearch start

Or check its status with:

sudo service elasticsearch status

Installing Logstash on Raspberry Pi

Installing Logstash on the Raspberry Pi turned out to be a bit more problematic than Elasticsearch. Again, Elastic doesn’t have a Logstash package that targets ARM architecture and you need to install it manually. StackOverflow posts and GitHub issues were particularly helpful for that – I listed the two I used in the References at the end of this article. Here the steps:

  1. Download the Logstash Debian package from Elastic’s repository.
    wget https://artifacts.elastic.co/downloads/logstash/logstash-7.7.0.deb
  2. Install the downloaded package using the dpkg package installer.
    sudo dpkg -i logstash-7.7.0.deb
  3. If you run Logstash at this point and encounter error similar to logstash load error: ffi/ffi -- java.lang.NullPointerException: null get Alexandre Alouit’s fix from GitHub using.
    wget https://gist.githubusercontent.com/toddysm/6b4b9c63f32a3dfc476a725561fc23af/raw/06a2409df3eba5054d7266a8227b991a87837407/fix.sh
  4. Go to /usr/share/logstash/logstash-core/lib/jars and check the version of the jruby-complete-X.X.X.X.jarJAR
  5. Open the downloaded fix.sh, and replace the version of the jruby-complete-X.X.X.X.jaron line 11 with the one from your distribution. In my case, that was  jruby-complete-9.2.11.1.jar
  6. Change the permissions of the downloaded fix.shscript, and run it.
    chmod 755 fix.sh
    sudo ./fix.sh
  7. You can run Logstash with.
    sudo service logstash start

You can check the Logstash logs in /var/log/logstash/logstash-plain.logfor information on whether Logstash is successfully started.

Installing Kibana on Raspberry Pi

Installing Kibana had different challenges. The problem is that Kibana requires an older version of NodeJS, but the one that is packed with the Debian package doesn’t run on Raspbian. What you need to do is to replace the NodeJS version after you install the Debian package. Here the steps:

  1. Download the Kinabna Debian package from Elastic’s repository.
    wget https://artifacts.elastic.co/downloads/kibana/kibana-7.7.0-amd64.deb
  2. Install the downloaded package using the dpkg package installer.
    sudo dpkg -i --force-all kibana-7.7.0-amd64.deb
  3. Move the redistributed NodeJS to another folder (or delete it completely) and create a new empty directory nodein the Kibana installation directory.
    sudo mv /usr/share/kibana/node /usr/share/kibana/node.OLD
    sudo mkdir /usr/share/kibana/node
  4. Next, download version 10.19.0 of NodeJS. This is the required version of NodeJS for Kibana 7.7.0. If you are installing another version of Kibana, you may want to check what NodeJS version it requires. The best way to do that is to start the Kibana service and it will tell you.
    wget https://nodejs.org/download/release/v10.19.0/node-v10.19.0-linux-armv7l.tar.xz
  5. Unpack the TAR and move the content to the nodedirectory under the Kibana installation directory.
    sudo tar -xJvf node-v10.19.0-linux-armv7l.tar.xz
    sudo mv ./node-v10.19.0-linux-armv7l.tar.xz/* /usr/share/kibana/node
  6. You may also want to create symlinks for the NodeJS executable and its tools.
    sudo ln -s /usr/share/kibana/node/bin/node /usr/bin/node
    sudo ln -s /usr/share/kibana/node/bin/npm /usr/bin/npm
    sudo ln -s /usr/share/kibana/node/bin/npx /usr/bin/npx
  7. Configure Kibana to accept requests on any IP address on the device.
    sudo vi /etc/kibana/kibana.yml

    Set the server.hostsetting to 0.0.0.0like this:

    server.host: "0.0.0.0"
  8. You can run Kibana with.
    sudo service kibana start

Conclusion

Although not supported, you can run the complete ELK stack on a Raspberry Pi 4[ad] device. It is not the most trivial installation, but it is not so hard either. In the following posts, I will explain how you can use the Elastic SIEM to monitor the traffic on your network.

References

Here are some additional links that you may find useful:

For a while, I’ve been planning to build a cybersecurity research environment in the cloud that I can use to experiment with and research malicious cyber activities. Well, yesterday I received the following message on my cell phone:

Hello mate, your FEDEX package with tracking code GB-6412-GH83 is waiting for you to set delivery preferences: <url_goes_here>

My curiosity to follow the link was so tempting that I said: “Now is the time to get this sandbox working!” I will write about the scam above in another blog post, but in this one, I would like to explain what I needed in the cloud and how did I set it up.

Cybersecurity Research Needs

What (I think) I need from this sandbox environment? I know that my requirements will change over time when I get deeper in the various scenarios but for now, here is what I wanted to have:

  • First, and foremost, a dedicated network where I can click on malicious links and browse dark web sites without the fear that my laptop or local network will get infected. I also need to have the ability to split the network into subnets for different purposes
  • Next, I needed pre-built VM images for various purposes. Here some examples:
    • A Windows client machine to act as an unsuspicious user. Most probably, this VM will need to have Microsoft Office and other software like Acrobat Reader installed. Other more advanced software that will help track URLs and monitor process may also be required on this machine. I will go deeper into this in a separate post
    • Linux machine with networking tools that will allow me to better understand and monitor network traffic. Kali Linux may be the right choice, but Ubuntu and CentOS may also be helpful as different agents
    • I may also need some Windows Server and Linux server images to simulate enterprise environments
  • Very important for me was to be able to create the VM I need quickly, do the work I need, and tear it down after that. Automating the process of VM creation and set up was high up on the list
  • Also, I wanted to make sure that if I forget the VM running, it will be shut down automatically to save money

With those basic requirements, I got to work setting up the cloud environment. For my experiments, I choose Microsoft Azure because I have a good relationship with the Azure team and also some credits that I can use.

Segregating Network Access

As mentioned in the previous section, I need a separate network for the VMs to avoid any possibility of infecting my laptop. Now, the question that comes to mind is: Is a single virtual network with several subnets OK or not? Having in mind that I can destroy this environment at any time and recreate it (yes, this is the automation part), I decided to go with a single VNet and the following subnets in it:

  • Sandbox subnet to be used to spin up virtual machines that can simulate user behavior. Those will be either single VMs running Windows client and Microsoft Office or set of those if I want to observe the lateral movement within the network. I may also have a Linux machine with Wireshark installed on it to watch network traffic to the web sites that host the malicious pages
  • Honeypot subnet to be used to expose vulnerable devices to the internet. Those may be Windows Server Datacenter VMs or Linux servers without outdated patches and weaker or compromised passwords
  • Frontend subnet to be used to host exploit frameworks for red team scenarios. One example can be the Social Engineering Toolkit (SET). Also, simple redirection services or other apps can be placed in this subnet
  • Public subnet that is required in Azure if I need to set up any load balancers for the exploit apps

With all this in mind, I need to make sure that traffic between those subnets is not allowed. Hence, I had to set up a Network Security Group for each subnet to disable inbound and outbound VNet traffic.

Cybersecurity Virtual Machine Images

This can be an evergrowing list depending on the scenarios but below is the list of the first few I would like to start with. I will give them specific names based on the purpose I plan to use them for:

User VM

The purpose of the User VM is to simulate an office worker (think of receptionist, accountant or, admin). Typically such machines have a Windows Client OS installed on it as well as other office software like Word, Excel, PowerPoint, Outlook and Acrobat Reader. Different browsers like Edge, Chrome, and Firefox will be also useful to test.

At this point, the question that I need to answer is whether I would like to have any other software installed on this machine that will allow me to analyze the memory, reverse engineer binaries or, monitor network traffic on it. I decided to go with a clean user machine to be able to see the exact behavior the user sees and not impact it with the availability of other tools. One other reason I thought this would be a better approach is to avoid malware that checks for the existence of specialized software.

When I started building this VM image, I also had to decide whether I want to have any anti-virus software installed on it. And of course, the question is: “What anti-virus software?” My company is a Sophos Partner, and this would be my obvious choice. I decided to build two VM images to go with: one without anti-virus and one with.

User Analysis VM

This one is the same as the User VM but with added software for malware analysis. The minimum I need installed is:

  • Wireshark for network scanning
  • Cygwin for the necessary Linux tools
  • HEXDump viewer/editor
  • Decompiler

I am pretty sure the list will grow over time and I will keep a tap on what software I have installed or prefer to use. What my intent is to build Kali Linux type of VM but for Windows 🙂

Kali VM

This one is the obvious choice. It can be used not only for offensive capabilities but also to cover your identity when accessing malicious sites. It has all the necessary tools for hackers and it is available from the Microsoft Azure Marketplace.

Tor VM

One last VM type I would like to have is a machine on which I can install the Tor browser for private browsing. Similar to the Kali VM, it can be used for hiding the identity of the machine and the user who accesses the malicious sites. It can also be used to gain access to Dark Web sites and forums for cybersecurity research purposes.

Those are the VM images I plan for now. Later on, I may decide on more.

Automating the Security Research VM Creation

Ideally, I would like to be able to create a whole security research environment with a single script. While this is certainly possible, I will not be able to do it in an hour to load the above URL. I will slowly implement the automation based on my needs going forward. However, one thing that I will do immediately is to save regular snapshots of my VMs once I install new software. I will also need to version those.

I can use those snapshots to quickly spin up new VM with the required software if the previous one gets infected (ant this will certainly happen). So, for now, my automation will be only to create a VM from an Azure Disk snapshot.

Shutting Down the Security Research VMs

My last requirement was to shut down the security research VMs when I don’t need them. Like every other person in the world, I forget things, and if I forget the VMs running, I can incur some expenses that I would not be happy to pay. While I am working on full-fledged scheduling capability for Azure resources for customers, it is still in the works and I cannot use it yet. Hence, I will leverage the built-in Azure functionality for Dev/Test workloads and will schedule a daily shutdown at 10:00 PM PST time. This way, if I forget to turn off any VM, it will not continue to run all the time.

With all this, my plan is ready and I can move on to build my security research environment.