Blog

In my last post, Implementing Quarantine Pattern for Container Images, I wrote about how to implement a quarantine pattern for container images and how to use policies to prevent the deployment of an image that doesn’t meet certain criteria. In that post, I also mentioned that the quarantine flag (not to be confused with the quarantine pattern 🙂) has certain disadvantages. Since then, Steve Lasker has convinced me that the quarantine flag could be useful in certain scenarios. Many of those scenarios are new and will play a role in the containers’ secure supply chain improvements. Before we look at the scenarios, let’s revisit how the quarantine flag works.

What is the Container Image Quarantine Flag?

As you remember from the previous post, the quarantine flag is set on an image at the time the image is pushed to the registry. The expected workflow is shown in the flow diagram below.

The quarantine flag is set on the image for as long as the Quarantine Processor completes the actions and removes the image from quarantine. We will go into detail about what those actions can be later on in the post. The important thing to remember is that, while in quarantine, the image can be pulled only by the Quarantine Processor. Neither the Publisher nor the Consumer or other actor should be able to pull the image from the registry while in quarantine. The way this is achieved is through special permissions that are assigned to the Quarantine Processor that the other actors do not have. Such permissions can be quarantine pull, and quarantine push, and should allow pulling artifacts from and pushing artifacts to the registry while the image is in quarantine.

Inside the registry, you will have a mix of images that are in quarantine and images that are not. The quarantined ones can only be pulled by the Quarantine Processor, while others can be pulled by anybody who has access to the registry.

Quarantining images is a capability that needs to be implemented in the container registry. Though this is not a standard capability, very few, if any, container registries implement it. Azure Container Registry (ACR) has a quarantine feature that is in preview. As explained in the previous post, the quarantine flag’s limitations are still valid. Mainly, those are:

  • If you need to have more than one Quarantine Processor, you need to figure out a way to synchronize their operations. The Quarantine Processor who completes the last action should remove the quarantine flag.
  • Using asynchronous processing is hard to manage. The Quarantine Processor manages all the actions and changes the flag. If you have an action that requires asynchronous processing, the Quarantine Processor needs to wait for the action to complete to evaluate the result and change the flag.
  • Last, you should not set the quarantine flag once you remove it. If you do that, you may break a lot of functionality and bring down your workloads. The problem is that you do not have the granularity of control over who can and cannot pull the image except for giving them the Quarantine Processor role.

With all that said, though, if you have a single Quarantine Processor, the quarantine flag can be used to prepare the image for use. This can be very helpful in the secure supply chain scenarios for containers, where the CI/CD pipelines do not only push the images to the registries but also produce additional artifacts related to the images. Let’s look at a new build scenario for container images that you may want to implement.

Quarantining Images in the CI/CD Pipeline

The one place where the quarantine flag can prove useful is in the CI/CD pipeline used to produce a compliant image. Let’s assume that for an enterprise, a compliant image is one that is signed, has an SBOM that is also signed, and passed a vulnerability scan with no CRITICAL or HIGH severity vulnerabilities. Here is the example pipeline that you may want to implement.

In this case, the CI/CD agent is the one that plays the Quarantine Processor role and manages the quarantine flag. As you can see, the quarantine flag is automatically set in step 4 when the image is pushed. Steps 5, 6, 7, and 8 are the different actions performed on the image while it is in quarantine. While those actions are not complete, the image should not be pullable by any consumer. For example, some of those actions, like the vulnerability scan, may take a long time to complete. You don’t want a developer to accidentally pull the image before the vulnerability scan is done. If one of those actions fails for any reason, the image should stay in quarantine as non-compliant.

Protecting developers from pulling non-compliant images is just one of the scenarios that a quarantine flag can help with. Another one is avoiding triggers for workflows that are known to fail if the image is not compliant.

Using Events to Trigger Image Workflows

Almost every container registry has an eventing mechanism that allows you to trigger workflows based on events in the registry. Typically, you would use the image push event to trigger the deployment of your image for testing or production. In the above case, if your enterprise has a policy for only deploying images with signatures, SBOMs, and vulnerability reports, your deployment will fail if the deployment is triggered right after step 4. The deployment should be triggered after step 9, which will ensure that all the required actions on the image are performed before the deployment starts.

To avoid the triggering of the deployment, the image push event should be delayed till after step 9. A separate event quarantine push can be emitted in step 4 that can be used to trigger actions related to the quarantine of the image. Note of caution here, though! As we mentioned previously, synchronizing multiple actors who can act on the quarantine flag can be tricky. If the CI/CD pipeline is your Quarantine Processor, you may feel tempted to use the quarantine push event to trigger some other workflow or long-running action. An example of such action can be an asynchronous malware scanning and detonation action, which cannot be run as part of the CI/CD pipeline. The things to be aware of are:

  • To be able to pull the image, the malware scanner must also have the Quarantine Processor role assigned. This means that you will have more than one concurrent Quarantine Processor acting on the image.
  • The Quarantine Processor that finishes first will remove the quarantine flag or needs to wait for all other Quarantine Processors to complete. This, of course, adds complexity to managing the concurrency and various race conditions.

I would strongly suggest that you have only one Quarantine Processor and use it to manage all activities from it. Else, you can end up with inconsistent states of the images that do not meet your compliance criteria.

When Should Events be Fired?

We already mentioned in the previous section the various events you may need to implement in the registry:

  • A quarantine push event is used to trigger workflows that are related to images in quarantine.
  • An image push event is the standard event triggered when an image is pushed to the registry.

Here is a flow diagram of how those events should be fired.

This flow offers a logical sequence of events that can be used to trigger relevant workflows. The quarantine workflow should be trigerred by the quarantine push event, while all other workflows should be triggered by the image push event.

If you look at the current implementation of the quarantine feature in ACR, you will notice that both events are fired if the registry quarantine is not enabled (note that the feature is in preview, and functionality may change in the future). I find this behavior confusing. The reason, albeit philosophical, is simple – if the registry doesn’t support quarantine, then it should not send quarantine push events. The behavior should be consistent with any other registry that doesn’t have quarantine capability, and only the image push event should be fired.

What Data Should the Events Contain?

The consumers of the events should be able to make a decision on how to proceed based on the information in the event. The minimum information that needs to be provided in the event should be:

  • Timestamp
  • Event Type: quarantine or push
  • Repository
  • Image Tag
  • Image SHA
  • Actor

This information will allow the event consumers to subscribe to registry events and properly handle them.

Audit Logging for Quarantined Images

Because we are discussing a secure supply chain for containers, we should also think about traceability. For quarantine-enabled registries, a log message should be added at every point the status of the image is changed. Once again, this is something that needs to be implemented by the registry, and it is not standard behavior. At a minimum, you should log the following information:

  • When the image is put into quarantine (initial push)
    • Timestamp
    • Repository
    • Image Tag
    • Image SHA
    • Actor/Publisher
  • When the image is removed from quarantine (quarantine flag is removed)
    Note: if the image is removed from quarantine, the assumption is that is passed all the quarantine checks.

    • Timestamp
    • Repository
    • Image Tag
    • Image SHA
    • Actor/Quarantine Processor
    • Details
      Details can be free-form or semi-structured data that can be used by other tools in the enterprise.

One question that remains is whether a message should be logged if the quarantine does not pass after all actions are completed by the Quarantine Processor. It would be good to get the complete picture from the registry log and understand why certain images stay in quarantine forever. On the other side, though, the image doesn’t change its state (it is in quarantine anyway), and the registry needs to provide an API just to log the message. Because the API to remove the quarantine is not a standard OCI registry API, a single API can be provided to both remove the quarantine flag and log the audit message if the quarantine doesn’t pass. ACR quarantine feature uses the custom ACR API to do both.

Summary

To summarize, if implemented by a registry, the quarantine flag can be useful in preparing the image before allowing its wider use. The quarantine activities on the image should be done by a single Quarantine Processor to avoid concurrency and inconsistencies in the registry. The quarantine flag should be used only during the initial setup of the image before it is released for wider use. Reverting to a quarantine state after the image is published for wider use can be dangerous due to the lack of granularity for actor permissions. Customized policies should continue to be used for images that are published for wider use.

One important step in securing the supply chain for containers is preventing the use of “bad” images. I intentionally use the word “bad” here. For one enterprise, “bad” may mean “vulnerable”; for another, it may mean containing software with an unapproved license; for a third, it may be an image with a questionable signature; possibilities are many. Also, “bad” images may be OK to run in one environment (for example, the local machine of a developer for bug investigation) but not in another (for example, the production cluster). Lastly, the use of “bad” images needs to be verified in many phases of the supply chain – before release for internal use, before build, before deployment, and so on. The decision of whether a container image is “bad” cannot be made in advance and depends on the consumer of the image. One common way to prevent the use of “bad” images is the so-called quarantine pattern. The quarantine pattern prevents an image from being used unless certain conditions are met.

Let’s look at a few scenarios!

Scenarios That Will Benefit from a Quarantine Pattern

Pulling images from public registries and using those in your builds or for deployments is risky. Such public images may have vulnerabilities or malware included. Using them as base images or deploying them to your production clusters bypass any possible security checks, compromising your containers’ supply chain. For that reason, many enterprises ingest the images from a public registry into an internal registry where they can perform additional checks like vulnerability or malware scans. In the future, they may sign the images with an internal certificate, generate a Software Bill of Material (SBOM), add provenance data, or something else. Once those checks are done (or additional data about the image is generated), the image is released for internal use. The public images are in the quarantined registry before they are made available for use in the internal registry with “golden” (or “blessed 🙂 ) images.

Another scenario is where an image is used as a base image to build a new application image. Let’s say that two development teams use the  debian:bullseye-20220228 as a base image for their applications. The first application uses  libc-bin, while the second one doesn’t. libc-bin in that image has several critical and high severity vulnerabilities. The first team may not want to allow the use of the debian:bullseye-20220228 as a base image for their engineers, while the second one may be OK with it because the libc-bin vulnerabilities may not impact their application. You need to selectively allow the image to be used in the second team’s CI/CD pipeline but not in the first.

In the deployment scenario, teams may be OK deploying images with the developers’ signatures in the DEV environments, while the PROD ones should only accept images signed with the enterprise keys.

As you can see, deciding whether an image should be allowed for use or not is not a binary decision, and it depends on the intention of its use. In all scenarios above, an image has to be “quarantined” and restricted for certain use but allowed for another.

Options for Implementing the Quarantine Pattern

So, what are the options to implement the quarantine pattern for container images?

Using a Quarantine Flag and RBAC for Controlling the Access to an Image

This is the most basic but least flexible way to implement the quarantine pattern. Here is how it works!

When the image is pushed to the registry, the image is immediately quarantined, i.e. the quarantine flag on the image is set to  TRUE. A separate role like QuarantineReader is created in the registry and assigned to the actor or system allowed to perform tasks on the image while in quarantine. This role allows the actor or system to pull the image to perform the needed tasks. It also allows changing the quarantine flag from TRUE to FALSE when the task is completed.

The problem with this approach becomes obvious in the scenarios above. Take, for example, the ingestion of public images scenario. In this scenario, you have more than one actor that needs to modify the quarantine flag: the vulnerability scanner, the malware scanner, the signer, etc., before the images are released for internal use. All those tasks are done outside the registry, and some of them may run on a schedule or take a long time to complete (vulnerability and malware scans, for example). All those systems need to be assigned the QuarantineReader role and allowed to flip the flag when done. The problem, though, is that you need to synchronize between those services and only change the quarantine flag from TRUE to FALSE only after all the tasks are completed.

Managing concurrency between tasks is a non-trivial job. This complicates the implementation logic for the registry clients because they need to interact with each other or an external system that synchronizes all tasks and keeps track of their state. Unless you want to implement this logic into the registry, which I would not recommend.

One additional issue with this approach is its extensibility. What if you need to add one more task to the list of things that you want to do on the image before being allowed for use? You need to crack open the code and implement the hooks to the new system.

Lastly, some of the scenarios above are not possible at all. If you need to restrict access to the image to one team and not another, the only way to do it is to assign the QuarantineReader role to the first team. This is not optimal, though, because the meaning of the role is only to assign it so systems that will perform tasks to take the image out of quarantine and not use it for other purposes. Also, if you want to make decisions based on the content of vulnerability reports or SBOMs, this quarantine flag approach is not applicable at all.

Using Declarative Policy Approach

A more flexible approach is to use a declarative policy. The registry can be used to store all necessary information about the image, including vulnerability and malware reports, SBOMs, provenance information, and so on. Well, soon, registries will be able to do that 🙂 If your registry supports ORAS reference types, you can start saving those artifacts right now. In the future, thanks to the Reference Types OCI Working Group, every OCI-compliant registry should be able to do the same. How does that work?

When the image is initially pushed to the registry, no other artifacts are attached to it. Each individual system that needs to perform a task on the image can run on its own schedule. Once it completes the task, it pushes a reference type artifact to the registry with the subject of the image in question. Every time the image is pulled from the registry, the policy evaluates if the required reference artifacts are available; if not, the image is not allowed for use. You can define different policies for different situations as long as the policy engine understands the artifact types. Not only that, but you can even make decisions on the content of the artifacts as long as the policy engine is intelligent enough to interpret those artifacts.

Using the declarative policy approach, the same image will be allowed for use by clients with different requirements. Extending this is as simple as implementing a new policy, which in most cases doesn’t require coding.

Where Should the Policy Engine be Implemented?

Of course, the question that gets raised is where the policy engine should be implemented – as part of the registry or outside of it. I think the registries are intended to store the information and not make policy decisions. Think of a registry as yet another storage system – it has access control implemented but the only business logic it holds is how to manage the data. Besides that, there are already many policy engines available – OPA is the one that immediately comes to mind, that is flexible enough to enable this functionality relatively easily. Policy engines are already available, and different systems are already integrated with them. Adding one more engine as part of the registry will just increase the overhead of managing policies.

Summary

To summarise, using a declarative policy-based approach to control who should and shouldn’t be able to pull an artifact from the registry is more flexible and extensible. Adding capabilities to the policy engines to understand the artifact types and act on those will allow enterprises to develop custom controls tailored to their own needs. In the future, when policy engines can understand the content of each artifact, those policies will be able to evaluate SBOMs, vulnerability reports, and other content. This will open new opportunities to define fine-grained controls for the use of registry artifacts.

While working on a process of improving the container secure supply chain, I often need to go over the current challenges of patching container vulnerabilities. With the introduction of Automatic VM Patching, having those conversations are even more challenging because there is always the question: “Why can’t we patch containers the same way we patch VMs?” Really, why can’t we? First, let’s look at how VM and container workloads differ.

How do VM and Container Workloads Differ?

VM-based applications are considered legacy applications, and VMs fall under the category of Infrastructure-as-a-Service (IaaS) compute services. One of the main characteristics of IaaS compute services is the persistent local storage that can be used to save data on the VM. Typically the way you use the VMs for your application is as follows:

  • You choose a VM image from the cloud vendor’s catalog. The VM image specifies the OS and its version you want to run on the VM.
  • You create the VM from that image and specify the size of the VM. The size includes the vCPUs, memory, and persistent storage to be used by the VM.
  • You install the additional software you need for your application on the VM.

From this point onward, the VM workload state is saved to the persistent storage attached to the VM. Any changes to the OS (like patches) are also committed to the persistent storage, and next time the VM workload needs to be spun up, those are loaded from there. Here are the things to remember for VM-based workloads:

  • VM image is used only once when the VM workload is created.
  • Changes to the VM workload are saved to the persistent storage; the next time the VM is started, those changes are automatically loaded.
  • If a VM workload is moved to a different hardware, the changes will still be loaded from the persistent storage.

How do containers differ, though?

Whenever a new container workload is started, the container image is used to create the container (similar to the VM). If the container workload is stopped and started on the same VM or hardware, any changes to the container will also be automatically loaded. However, because orchestrators do not know whether the new workload will end up on the same VM (or hardware due to resource constraints), they do not stop but destroy the containers, and if a new one needs to be spun up, they use the container image again to create it.

That is a major distinction between VMs and containers. While the VM image is used only once when the VM workload is created, the container images are used repeatedly to re-create the container workload when moving from one place to another and increasing capacity. Thus, when a VM is patched, the patches will be saved to the VM’s persistent storage, while the container patches need to be available in the container image for the workloads to be always patched.

The bottom line is, unlike VMs, when you think of how to patch containers, you should target improvements in updating the container images.

A Timeline of a Container Image Patch

For this example, we will assume that we have an internal machine learning team that builds their application image using  python:3.10-bullseye as a base image. We will concentrate on the timelines for fixing the OpenSSL vulnerabilities CVE-2022-0778 and CVE-2022-1292. The internal application team’s dependency is OpenSSL <– Debian <– Python. Those are all Open Source Software (OSS) projects driven by their respective communities. Here is the timeline of fixes for those vulnerabilities by the OSS community.

2022-03-08: python:3.10.2-bullseye Released

Python publishes  puthon:3.10.2-bullseye container image. This is the last Python image before the CVE-2022-0778 OpenSSL vulnerability was fixed.

2022-03-15: OpenSSL CVE-2022-0778 Fixed

OpenSSL publishes fix for CVE-2022-0778 impacting versions 1.0.2 – 1.0.2zc, 1.1.1 – 1.1.1m, and 3.0.0 – 3.0.1.

2022-03-16: debian:bullseye-20220316 Released

Debian publishes  debian:bullseye-20220316 container image that includes a fix for CVE-2022-0778.

2022-03-18: python:3.10.3-bullseye Released

Python publishes  python:3.10.3-bullseye container image that includes a fix for CVE-2022-0778.

2022-05-03: OpenSSL CVE-2022-1292 Fixed

OpenSSL publishes fix for CVE-2022-1292 impacting versions 1.0.2 – 1.0.2zd, 1.1.1 – 1.1.1n, and 3.0.0 – 3.0.2.

2022-05-09: debian:bullseye-20220509 Released

Debian publishes debian:bullseye-20220316 container image that DOES NOT include a fix for CVE-2022-1292.

2022-05-27: debian:bullseye-20220527 Released

Debian publishes  debian:bullseye-20220527 container image that includes a fix for CVE-2022-1292.

2022-06-02: python:3.10.4-bullseye Released

Python publishes  python:3.10.4-bullseye container image that includes a fix for CVE-2022-1292.

There are a few important things to notice in this timeline:

  • CVE-2022-0778 was fixed in the whole chain within three days only.
  • In comparison, CVE-2022-1292 took 30 days to fix in the whole chain.
  • Also, in the case of CVE-2022-1292, Debian released a container image after the fix from OpenSSL was available, but that image DID NOT contain the fix.

The bottom line is:

  • Timelines for fixes by the OSS communities are unpredictable.
  • The latest releases of container images do not necessarily contain the latest software patches.

SLAs and the Typical Process for Fixing Container Vulnerabilities

The typical process teams use to fix vulnerabilities in container images is waiting for the fixes to appear in the upstream images. In our example, the machine learning team must wait for the fixes to appear in the python:3.10-bullseye image first, then rebuild their application image, test the new image, and re-deploy to their production workloads if tests pass. Let’s call this process wait-rebuild-test-redeploy (or WRTR if you like acronyms:)).

The majority of enterprises have established SLAs for fixing vulnerabilities. For those that have not established such yet, things will soon change due to the Executive Order for Improving the Nation’s Cybersecurity. Many enterprises model their patching processes based on the FedRAMP 30/90/180 rules specified in the FedRAMP Continuous Monitoring Strategy Guide. According to the FedRAMP rules, high severity vulnerabilities must be remediated within 30 days. CISA’s Operational Directive for Reducing the Risk of Known Exploited Vulnerabilities has much more stringent timelines of two weeks for vulnerabilities published in CISA’s Known Exploited Vulnerabilities Catalog.

Let’s see how the timelines for patching the abovementioned OpenSSL vulnerabilities fit into those SLAs for the machine learning team using the typical process for patching containers.

CVE-2022-0778 was published on March 15th, 2022. It is a high severity vulnerability, and according to the FedRAMP guidelines, the machine learning team has till April 14th, 2022, to fix the vulnerability in their application image. Considering that the python:3.10.3-bullseye image was published on March 18th, 2022, the machine learning team has 27 days to rebuild, test, and redeploy the image. This sounds like a reasonable time for those activities. Luckily, CVE-2022-0778 is not in the CISA’s catalog, but the team would still have 11 days for those activities if it was.

The picture with CVE-2022-1292 does not look so good, though. The vulnerability was published on May 3rd, 2022. It is a critical severity vulnerability, and according to the FedRAMP guidelines, the machine learning team has till June 2nd, 2022, to fix the vulnerability. Unfortunately, though, python:3.10.4-bullseye image was published on June 2nd, 2022. This means that the team needs to do the re-build, testing, and re-deployment on the same day the community published the image. Either the team needs to be very efficient with their processes or work around the clock that day to complete all the activities (after hoping the community will publish a fix for the python image before the SLA deadline). That is a very unrealistic expectation and also impacts the team’s morale. If by any chance, the vulnerability appeared on the CISA’s catalog (which luckily it did not), the team would not be able to fix it within the two-week SLA.

That proves that the wait-rebuild-test-redeploy (WRTR) process is ineffective in meeting the SLAs for fixing vulnerabilities in container images. But, what can you currently do to improve this and take control of the timelines?

Using Multi-Stage Builds to Fix Container Vulnerabilities

Until the container technology evolves and a more declarative way for patching container images is available, teams can use multi-stage builds to build their application images and fix the base image vulnerabilities. This is easily done in the CI/CD pipeline. This approach will also allow teams to control the timelines for vulnerability fixes and meet their SLAs. Here is an example how you can solve the issue with patching the above example:

FROM python:3.10.2-bullseye as baseimage

RUN apt-get update; \
     apt-get upgrade -y

RUN adduser appuser

FROM baseimage

USER appuser

WORKDIR /app

CMD [ "python", "--version" ]

In the above Dockerfile, the first stage of the build updates the base image with the latest patches. The second stage builds the application and runs it with the appropriate user permissions. Using this approach you awoid the wait part in the WRTD process above and you can always meet your SLAs with simple re-build of the image.

Of course, this approach also has drawbacks. One of its biggest issues is the level of control teams have over what patches are applied. Another one is that some teams do not want to include layers in their images that do not belong to the application (i.e. modify the base image layers). Those all are topics for another post 🙂

Photo by Webstacks on Unsplash

In Part 1 of the series Signatures, Key Management, and Trust in Software Supply Chains, I wrote about the basic concepts of identities, signatures, and attestation. In this one, I will expand on the house buying scenario, that I hinted about in Part 1, and will describe a few ways to exploit it in the physical world. Then, I will map this scenario to the digital world and delve into a few possible exploits. Throughout this, I will also suggest a few possible mitigations in both the physical as well as the digital world. The whole process as you may have already known is called threat modeling.

Exploiting Signatures Without Attestation in the Offline World

For the purpose of this scenario, we will assume that the parties involved are me and the title company. The document that needs to be signed is the deed (we can also call it the artifact). Here is a visual representation of the scenario:

Here is how the trust is established:

  • The title company has an inherent trust in the government.
  • This means that the title company will trust any government-issued identification like a driving license.
  • In my meeting with the title company, I present my driving license.
  • The title company verifies the driving license is legit and establishes trust in me.
  • Last, the title company trusts the signature that I use to sign the deed in front of them.
  • From here on, the title company trusts the deed to proceed with the transaction.

As we can see, establishing trust between the parties involves two important conditions – implicit trust in a central authority and verification of identity. Though, this process is easily exploitable with fake IDs (like fake driving license) as shown in the picture below.

In this case, an imposter can obtain a fake driving license and impersonate me in the transaction. If the title company can be fooled that the driving license is issued by the government, they can falsely establish trust in the imposter and allow him to sign the deed. From there on, the title company considers the deed trusted and continues with the transaction.

The problem here is with the verification step – the title company does not do a real-time verification if the driving license is legitimate. The verification step is done manually and offline by an employee of the title company and relies on her or his experience to recognize forged driving licenses. If this “gate” is passed, the signature on the deed becomes official and will not be verified anymore in the process.

There is one important step in this process that we didn’t mention yet. When the title company employee verifies the driving license, she or he also takes a photocopy of the driving license and attaches it to the documentation. This photocopy becomes part of the audit trail for the transaction if later on is discovered that the transaction needs to be reverted.

Exploiting Signatures Without Attestation in the Digital World

The above process is easily transferable to the digital world. In the following GitHub project I have an example of signing a simple text file artifact.txt. The example uses self-signed certificates for verifying the identity and the signature.

There are two folders in the repository. The real folder contains the files used to generate a key and X.509 certificate that is tied to my real identity and verified using my real domain name toddysm.com. The fake folder contains the files used to generate a key and X.509 certificate that is tied to an imposter identity that can be verified with a look-alike (or fake) domain. The look-alike domain uses homographs to replace certain characters in my domain name. If the imposter has ownership of the imposter domain, obtaining a trusted certificate with that domain name is easily achievable.

The dilemma you are presented with is, which certificate to trust – the one here or the one here. When you verify both certificates using the following commands:

openssl x509 -nameopt lname,utf8 -in [cert-file].crt -text -noout | grep Subject:
openssl x509 -nameopt lname,utf8 -in [cert-file].crt -text -noout | grep Issuer:

they both return visually indistinguishable information:

Subject: countryName=US, stateOrProvinceName=WA, localityName=Seattle, organizationName=Toddy Mladenov, commonName=toddysm.com, emailAddress=me@toddysm.com
Issuer: countryName=US, stateOrProvinceName=WA, localityName=Seattle, organizationName=Toddy Mladenov, commonName=toddysm.com, emailAddress=me@toddysm.com

It is the same as looking at two identical driving licenses, a legitimate one and a forged one, that have no visible differences.

The barrier for this exploit using PGP keys and SSH keys is even lower. While X.509 certificates need to be issued by a trusted certificate authority (CA), PGP and SSH keys can be issued by anybody. Here is a corresponding example of a valid PGP key and an imposter PGP key. Once again, which one would you trust?

Though, compromising CAs is not something that we can ignore. There are numerous examples where forged certificates issued by legitimate CAs are used:

Let’s also not forget that Stuxnet malware was signed by compromised JMicron and Realtec private keys. In the case of compromised CA, malicious actors don’t even need to use homographs to deceive the public – they can issue the certificate with the real name and domain.

Unlike the physical world though, the digital one misses the very important step of collecting audit information when the signature is verified. I will come back to that in the next post of the series where I plan to explore the various controls that can be put to increase security.

Based on the above though, it is obvious that the trust whether in a single entity or a central certificate authority (CA), has highly diminished in recent years.

Oh, and don’t trust the keys that I published on GitHub! 🙂 Anybody can copy them or generate new ones with my information – unfortunately obtaining that information is quite easy nowadays.

Exploiting Signatures With Attestation in the Offline World

Let’s look at the example I introduced in the previous post where more parties are involved in the process of selling my house. Here is the whole scenario!

Because I am unable to attend the signing of the documents, I need to issue a power of attorney for somebody to represent me. This person will be able to sign the documents on my behalf. First and foremost, I need to trust that person. But my trust in this person doesn’t automatically transfer to the title company that will handle the transaction. For the title company to trust my representative, the power of attorney needs to be attested by a certified notary. Only then will the title company trust the power of attorney document and accept the signature of my representative.

Here is the question: “How the introduction of the notary increases the security?” Note that I used the term “increase security”. While there is no 100% guarantee that this process will not fail…

By adding one more step to the process, we introduce an additional obstacle that reduces the probability for malicious activity to happen, which increases the security.

What the notary will eventually prevent is that my “representative” forcefully makes me sign the power of attorney. My security is compromised and now my evil representative can use the power of attorney to sell my house to himself for just a dollar. The purpose of the notary is to attest that I willfully signed the document and was present (and in good health) during the signing. Of course, this can easily be exploited if both, the representative and the notary are evil, as shown in the below diagram.

As you can see in this scenario, all parties have valid government-issued IDs that the title company trusts. However, the process is compromised if there is collusion between the malicious actor (evil representative) and the notary.

Other ways to exploit this process are if the notary or my representative are both or individually impersonated. The impersonation is described in the section above – Exploiting Signatures Without Attestation in the Offline World.

Exploiting Signatures With Attestation in the Digital World

There is a lot of talks recently about implementing attestation systems that will save signature receipts in an immutable ledger. This is presented as the silver bullet solution for signing software artifacts (check out the Sigstore project). Similar to the notary example in the previous section, this approach may increase security but it may also have a negative impact. Because they compare themselves to Let’s Encrypt, let me take a stab at how Let’s Encrypt impacted the security on the Web.

Before Let’s Encrypt, only owners that want to invest money to pay for valid certificates had HTTPS enabled on their websites. More importantly, though, browsers showed a clear indicator when a site was using plain HTTP protocol and not the secure one. From a user’s point of view it was easy to make the decision that if the browser address bar was red, you should not enter your username and password or your credit card. Recognizing malicious sites was relatively easy because malicious actors didn’t want to spend the money and time to get a valid certificate.

Let’s Encrypt (and the browser vendors) changed that paradigm. Being free, Let’s Encrypt allows anybody to issue a valid (and “trusted”??? 🤔) certificate and enable HTTPS for their site. Not only that but Let’s Encrypt made it so easy that you can get the certificate issued and deployed to your web server using automation within seconds. The only proof you need to provide is the ownership of the domain name for your server. At the same time, Google led the campaign to change the browser indicators to show a very mediocre lock icon in the address bar that nobody except maybe a few pays any attention to anymore. As a result, every malicious website now has HTTPS enabled and there is no indication in the browser to tell you that it is malicious. In essence, the lock gives you a false sense of security.

I would argue that Let’s Encrypt (and the browser vendors) in fact decreased the security on the web instead of increasing it. Let me be clear! While I think Let’s Encrypt (and the browser vendors) decreased the security, what they provide had a tremendous impact on privacy. Privacy should not be discounted! Though in marketing messages those two terms are used interchangeably and this is not for the benefit of the users.

In the digital world, the CA can play the role of the notary in the physical world. The CA verifies the identity of the entity that wants to sign artifacts and issues a “trusted” certificate. Similar to a physical world notary, the CA will issue a certificate for both legit as well as malicious actors, and unlike the physical world, the CA has very basic means to verify identities. In the case of Let’s Encrypt this is the domain ownership. In the case of Sigstore that will be a GitHub account. Everyone can easily buy a domain or register a GitHub account and get a valid certificate. This doesn’t mean though that you should trust it.

Summary

The takeaway from this post for you should be that every system can be exploited. We learn and create systems that reduce the opportunities for exploitation but that doesn’t make them bulletproof. Also, when evaluating technologies we should not only look at the shortcomings of the previous technology but also at the shortcoming of the new shiny one. Just adding attestation to the signatures will not be enough to make signatures more secure.

In the next post, I will look at some techniques that we can employ to make signatures and attestations more secure.

Photo by Erik Mclean on Unsplash

 

 

For the past few months, I’ve been working on a project for a secure software supply chain, and one topic that seems to always start passionate discussions is the software signatures. The President’s Executive Order on Improving the Nation’s Cybersecurity (EO) is a pivotal point for the industry. One of the requirements is for vendors to document the supply chain for software artifacts. Proving the provenance of a piece of software is a crucial part of the software supply chain, and signatures play a main role in the process. Though, there are conflicting views on how signatures should work. There is the traditional PKI (Public Key Infrastructure) approach that is well established in the enterprises, but there are other traditional and emerging technologies that are brought up in discussions. These include PGP key signatures, SSH key signatures, and the emerging ephemeral key (or keyless) signatures (here, here, and lately here).

While PKI is well established, the PKI shortcomings were outlined by Bruce Schneier and Carl Elisson more than 20 years ago in their paper. The new approaches are trying to overcome those shortcomings and democratize the signatures the same way Let’s Encrypt democratized HTTPS for websites. Though, the question is whether those new technologies improve security over PKI? And if so, how? In a series of posts, I will lay out my view of the problem and the pros and cons of using one or another signing approach, how the trust is established, and how to manage the signing keys. I will start with the basics using simple examples that relate to everyday life and map those to the world of digital signatures.

In this post, I will go over the identity, signature, and attestation concepts and explain why those matter when establishing trust.

What is Identity?

Think about your own experience. Your identity is you! You are identified by your gender, skin color, facial and body characteristics, thumbprint, iris print, hair color, DNA etc. Unless you have an identical twin, you are unique in the world. Even if you are identical twins, there are differences like thumbprints and iris prints that make you unique. The same is true for other entities like enterprises, organizations, etc. Organizations have names, tax numbers, government registrations, addresses, etc. As a general rule, changing your identity is hard if not impossible. You can have plastic surgery but you cannot change your DNA. The story may be a bit different for organizations that can rename themselves, get bought or sold, change headquarters, etc. but it is still pretty easy to uniquely identify organizations.

All the above points that identities are:

  • unique
  • and impossible (or very hard) to change

In the digital world, identities are an abstract concept. In my opinion, it is wrong to think that identities can be changed in both the physical and the digital world. Although we tend to think that they can be changed, this is not true – what can be changed is the way we prove our identity. We will cover that shortly but before that, let’s talk about trust.

If you are a good friend of mine, you may be willing to trust me but if you just met me, your level of trust will be pretty low. Trust is established based on historical evidence. The longer you know me, and the longer I behave honestly, the more you will be willing to trust me. Sometimes I may not be completely honest, or I may borrow some money from you and not return them. But I may buy you a beer every time we go out and offset that cost and you may be willing to forgive me. It is important to note that trust is very subjective, and while you may be very forgiving, another friend of mine maybe not. He may decide that I am not worth his trust and never borrow me money again.

How do We Prove Our Identity?

In the physical world, we prove our identity using papers like a driving license, a passport, an ID card, etc. Each one of those documents is issued for a purpose:

  • The driving license is mainly used to prove you can drive a motorized vehicle on the US streets. Unless it is an enhanced driving license, you (soon) will not be able to use it to board a domestic flight. However, you cannot cross borders with your driving license and you cannot use it to even rent a car in Europe (unless you have an international driving license).
  • To cross borders you need a passport. The passport is the only document that is recognized by border authorities in other countries that you visit. You cannot use your US driving license to cross the borders in Europe. The interesting part is that you do not need a driving license to get a passport or vice versa.
  • You also have your work badge. Your work badge identifies you as an employee of a particular organization. Despite the fact that you have a driving license and a passport, you cannot enter the buildings without your badge. However, to prove to your employer that you are who you are for them to issue you the badge, you must have a driving license or a passport.

In the digital world, there are similar concepts to prove our identity.

  • You can use a username, password and another factor (2FA/MFA token) to prove your identity to a particular system.
  • App secret that you can generate in a system can also be used to prove your identity.
  • OAuth or SSO (single sign-on) token issued by a third party is another way to prove your identity to a particular system. That system though needs to trust the third party.
  • SSH key can be an alternate way to prove your identity. You can use it in conjunction with username/password combination or separately.
  • You can use PGP key to prove your identity to an email recipient.
  • Or use a TLS certificate to prove the identity of your website.
  • And finally, you can use an X.509 certificate to prove your identity.

As you can see, similar to the physical world, in the digital world you have multiple ways to prove your identity to a system. You can use more than one way for a single system. The example that comes to mind is GitHub – you can use app secret or SSH key to push your changes to your repository.

How Does Trust Tie to the Concepts Above? Let’s say that I am a good developer. My code published on GitHub has a low level of bugs, it is well structured, well documented, easy to use, and updated regularly. You decide that you can trust my GitHub account. However, I also have DockerHub account that I am negligent with – I don’t update the containers regularly, they have a lot of vulnerabilities, and are sloppily built. Although you are my friend and you trust my GitHub account, you are not willing to trust my DockerHub account. This example shows that trust is not only subjective but also based on context.

OK, What Are Signatures?

Here is where things become interesting! In the physical world, a signature is a person’s name written in that person’s handwriting. Just the signature does not prove my identity. Wikipedia’s entry for signature defines the traditional function of a signature as follows:

…to permanently affix to a document a person’s uniquely personal, undeniable self-identification as physical evidence of that person’s personal witness and certification of the content of all, or a specified part, of the document.

The keyword above is self-identification. This word in the definition has a lot of implications:

  • First, as a signer, I can have multiple signatures that I would like to use for different purposes. I.e. my identity may use different signatures for different purposes.
  • Second, nobody attests to my signature. This means that the trust is put in a single entity – the signer.
  • Third, a malicious person can impersonate me and use my signature for nefarious purposes.

Interestingly though, we are willing to accept the signature as proof of identity depending on the level of trust we have in the signer. For example, if I borrow $50 from you and give you a receipt with my signature the I will pay you back in 30 days, you may be willing to accept it even if you don’t know me too much (i.e. your level of trust is relatively low). This is understandable because we decide to lower our level of trust to just self-identification. I can increase your level of trust if I show you my driving license that has my signature printed on it and you can compare both signatures. However, showing you my driver’s license is actually an attestation, which is covered in detail below.

In the digital world, to create a signature, you need a private key and to verify a signature, you need a public key (check the Digital Signature article on Wikipedia). The private and the public key are related and work in tandem – the private key signs the content and the public key verifies the signature. You own both but keep the private secret and publish the public to everybody to use. From the examples I have above, you can use PGP, SSH, and X.509 to sign content. However, they have differences:

  • PGP is a self-generated key-pair with additional details like name and email address included in the public certificate, that can be used for (pseudo)identification of the entity that signs the content. You can think of it as similar to a physical signature, where, in addition to the signature you verbally provide your name and home address as part of the signing process.
  • SSH is also a self-generated key pair but has no additional information attached. Think of it as the plain physical signature.
  • With X.509 you have a few options:
    • Self-generated key-pair similar to the PGP approach but you can provide more self-identifying information. When signing with such a private key you can assume that it is similar to the physical signature, where you verbally provide your name, address, and date of birth.
    • Domain Validated (DV) certificate that validates your ownership of a particular domain (this is exactly what Let’s Encrypt does). Think of this as similar to a physical signature where you verbally provide your name, address, and date of birth as well as show a utility bill with your name and address as part of the signing process.
    • Extended Validation (EV) certificate that validates your identity using legal documents. For example, this can be your passport as an individual or your state and tax registrations as an organization.
      Both, DV and EV X.509 certificates are issued by Certificate Authorities (CA), which are trusted authorities on the Internet or within the organization.

Note: X.509 is actually an ITU standard defining the format of public-key certificates and is at the basis of the PKI. The key pair can be generated using different algorithms. Though, the term X.509 is used (maybe incorrectly) as a synonym for the key-pair also.

Without any other variables in the mix, the level of trust that you may put on the above digital approaches would most probably be the following: (1-Lowest) SSH, (2) PGP and self-signed X.509, (3) DV X,509, and (4-Highest) EC X.509. Keep in mind that DV and EV X.509 are actually based on attestation, which is described next.

So, What is Attestation?

We finally came to it! Attestation, according to Meriam-Webster dictionary, is an official verification of something as true or authentic. In the physical world, one can increase the level of trust in a signature by having a Notary attest to the signature (lower level of trust) and adding government apostille (higher level of trust used internationally). In many states notaries are required (or highly encouraged) to keep a log for tracking purposes. While you may be OK with having only my signature on a paper for $50 loan, you certainly would want to have a notary attesting to a contract for selling your house to me for $500K. The level of trust in a signature increases when you add additional parties who attest to the signing process.

In the digital world, attestation is also present. As we’ve mentioned above, CAs act as the digital notaries who verify the identity of the signer and issue digital certificates. This is done for the DV and EV X.509 certificates only though. There is no attestation for PGP, SSH, and self-signed X.509 certificates. For digital signatures, there is one more traditional method of attestation – the Timestamp Authority (TSA). The TSA’s role is to provide an accurate timestamp of the signing to avoid tampering with the time by changing the clock on the computer where the signing occurs. Note that the TSA attests only for the accuracy of the timestamp of signing and not for the identity of the signer. One important thing to remember here is that without attestation you cannot fully trust the signature.

Here is a summary of the signing approaches and the level of trust we discussed so far.

Signing Keys and Trust

Signing Approach Level of Trust
SSH Key 1 - Lowest
PGP Key 2 - Low
X.509 Self-Signed 2 - Low
X.509 DV 3 - Medium
X.509 EV 4 - High

Now, that we’ve established the basics let’s talk about the validity period and why it matters.

Validity Period and Why it Matters?

Every identification document that you own in the physical world has an expiration date. OK, I lied! I have a German driving license that doesn’t have an expiration date. But this is an exception, and I can claim that I am one of the last who had that privilege – newer driving licenses in Germany have an expiration date. US driving licenses have an expiration date and an issue date. You need to renew your passport every five years in the US. Different factors determine why an identification document may expire. For a driving license, the reason may be that you lost some of your vision and you are not capable of driving anymore. For a passport, it may be because you moved to another country, became a citizen, and forfeit your right to be a US citizen.

Now, let’s look at physical signatures. Let’s say that I want to issue a power of attorney to you to represent me in the sale of my house while I am on a business trip for four weeks in Europe. I have two options:

  • Write you a power of attorney without an expiration date and have a notary attest to it (else nobody will believe you that you can represent me).
  • Write you a power of attorney that expires four weeks from today and have a notary attest to it.

Which one do you think is more “secure” for me? Of course the second one! The second power of attorney will give you only a limited period to sell my house. While this does not prevent you from selling it in a completely different transaction than the one I want, you are still given some time constraints. The counterparts in the transaction will check the power of attorney and will note the expiration date. If there is a final meeting four weeks and a day from now, that will require you to sign the final papers for the transaction, they should not allow you to do that because the power of attorney is not valid anymore.

Now, here is an interesting situation that often gets overlooked. Let’s say that I sign the power of attorney on Jan 1st, 2022. The power of attorney is valid till the end of day Jan 28th, 2022. I use my driving license to identify myself to the notary. My driving license has an expiration date of Jan 21st, 2022. Also, the notary’s license expires on Jan 24th, 2022. What is the last date that the power of attorney is valid? I will leave this exploration for one of the subsequent posts.

Time constraints are a basic measure to increase my security and prevent you from selling my house and pocketing the money later in the year. I will expand on this example in my next post where I will look at different ways to exploit signatures. But the basic lesson here is: the more time you have to exploit something, the higher probability there is for you to do so. Also, another lesson is: put an expiration date on all of your powers of attorney!

How does this look in the digital world?

  • SSH keys do not have expiration dates. Unless you provide the expiration date in the signature itself, the signature will be valid forever.
  • PGP keys have expiration dates a few years in the future. I just created a new key and it is set to expire on Jan 8th, 2026. If I sign an artifact with it and don’t provide an expiration date for the signature, it will be considered valid until Jan 8th, 2026.
  • X.509 certificates also have long expiration dates – 3, 12, or 24 months. Let’s Encrypt certificates have 3 months expiration dates. Root CA certificates have even longer expiration dates, which can be dangerous as we will explore in the future. Let’s Encrypt was the first to reduce the length of validity of their certificates to increase the security of certificate compromise because domains change hands quite often. Enterprises followed suit because the number of stolen enterprise certificates is growing.

Note: In the next post, I will expand a little bit more into the relationships between keys and signatures but for now, you can use them as the example above where I mention the various validity periods for documents used for the power of attorney.

Summary

If nothing else, here are the main takeaways that you should remember from this post:

  • Signatures cannot infer identities. Signatures can be forged even in the digital world.
  • One identity can have many signatures. Those signatures can be used for different purposes.
  • For a period of time, a signature can infer identity if it is attested to. However, the longer time passes, the lower the trust in this signature should be. Also, the period of time is subjective and dependent on the risk level of the signature consumer.
  • To increase security, signatures must expire. The shorter the expiration period, the higher the security (but also other constraints should be put in place).
  • Before trusting a signature, you should verify if the signed asset is still trustable. This is in line with the zero-trust principle for security: “Never trust, always verify!”.

Take a note that in the last bullet point, I intentionally use the term “asset is trustable” and not “signature is valid”. In the next post, I will go into more detail about what that means, how signatures can be exploited, and how context can provide value.

Featured image by StockSnap.

If you have missed the news lately, cybersecurity is one of the most discussed topics nowadays. From supply chain exploits to data leaks to business email compromise (BEC) there is no break – especially during the pandemic. Many (if not all) start with an account compromise. And if you ask any cybersecurity expert, they will tell you that the best way to protect your account is to use two-factor (or multi-factor) authentication. Well, let me tell you a secret – MFA sucks! Ask the Okta guys! Even they think MFA sucks. And they are a mobile security company. Though, Randall and I have different motives to make that claim.

By the way: 2FA stands for “two-factor authentication” while MFA stands for “multi-factor authentication”. I will use those two acronyms to save on some typing. And one more, TLA means a “three-letter acronym”.

Randall goes on and on in his post about why MFA sucks. Most of his points are valid! It is an annoying, and frustrating experience. I don’t know about slow, but I would argue against being pointless – it serves a purpose, a very good purpose. Where he is mainly wrong is thinking that the solution is yet another technology (Well, the whole point of his post is to market Okta’s new technology, so, he will get a pass for that 😉 ). This new technology will not address the source of the issue – that people are scared to use MFA. Take a look at the Twitter Account Security survey – why do you think only 2.3% (at the time of this writing) of all Twitter users have MFA enabled? Here are what I think the reasons are:

  • Complexity and lack of understanding of the technology
  • Fear of losing access to the accounts

I believe people are smart enough to grasp the benefits without too many explanations. What they are not clear is how to set it up and how to make sure they don’t lose access to their accounts. In general, my frustration is with how the technology vendors have implemented MFA – without any thought about the user experience. Let me illustrate what I mean with my own experience.

The Problem With Too Many MFAs

I have set up MFA on all my important accounts. The list is long: bank accounts, credit card accounts, stock brokerage accounts, government accounts (like taxes, DMV, etc), email accounts (like Office 365 and GMail), GitHub, Twitter, Facebook, you name it. Have been doing this for years already and the list continues to grow. I am also required by former and current employers to use MFA for my work accounts – also a long list. Here is a list of MFA methods I use (and I don’t claim this to be a comprehensive list):

  • SMS
  • email (several different emails)
  • authenticators apps (here the things are getting crazy)
    • Microsoft Authenticator
    • Google Authenticator (I started with this one)
    • Entrust
    • VIP Access
    • Lastpass Authenticator
  • other vendor-specific implementations (I really don’t know how to call those, but they have their own way to do it)
    • Apple
    • WordPress
    • Facebook
  • Yubikey (I have three of those and I will ignore those in this post because the hardware key experience has been the same since the … 90s or so)

You’ll think I am crazy, right? Why do I use so many authenticators and MFA methods? I don’t want to, but the problem is that I have to!

First of all, this all evolved over the last 7-8 years since I started using MFA. It started with Google Authenticator; then Microsoft Authenticator and the Lastpass one; then a few banks added email and SMS (not very secure but interestingly some very big financial institutions still don’t offer other options!!!) while others struck deals with specific vendors to use their own authenticators – hence, I had to install one-off apps for those accounts. A couple of years ago I bought Yubukeys to secure my password managers and some important work and bank accounts.

At some point in time, I decided to unify on a single method! Or a single authenticator app and the Yubikey. That turns out to be impossible! Despite the fact that there is a standard, the authenticator vendors do not give you an easy way to transfer the seeds from one app to another. The only way to do that is to go over tens of accounts and register the new authenticator. Who wants to do that! Also, a few of my financial institutions do not offer a way to use a different MFA method than the one they approved. So, I am stuck with the one-offs. Unless I want to change to different financial institutions. But… who wants to do that!

There is one more problem with the use of so many tools. Very often the systems that are set up with MFA ask you to “enter the code from your MFA app”. The question is: which MFA app have I registered for that system? There is an argument whether having details about the app used to generate the code will compromise the security. Some companies provide that information, while others don’t. If I am allowed to use a single authenticator app for all my accounts, I don’t mind not having information on what tool have I registered. But in the current situation, giving me a hint is a requirement for usability. Anyway, if my authenticator app is compromised, not telling me what app to use will make absolutely no difference.

Another problem with the authenticator apps is that (and this is prevalent in technology) the app thinks it knows best how to name things. If I for example have two accounts for GitHub and want to use MFA, the authenticators will show them both as GitHub. What if I have ten? Which code should I use for which account is really hard to figure out.

The Problem With Switching Phones

Wait! Should I say: “The Problem With Losing Phones?” No, the problem with losing phones is next!

This is where the complexity of the MFA approach starts to show up! I don’t think any of the above-mentioned vendors really thought the whole user experience thoroughly through. I will add Duo to that list because I used it and it also sucks. Also, I will make a note here that colleagues recommended me Twillio’s Authy but I am already so deep in my current (diverse) ecosystem that I have no desire to try one more of those.

My wife (who is not in technology) has an iPhone 7 and she strongly refuses to change it because of all the trouble she needs to go through to set up her apps on a new phone. I spent a lot of time convincing her to set up at least SMS-based MFA for her bank and credit card accounts. And I think that will be the extend that she will go to. After I switched from iPhone 7 to the latest (and greatest) iPhone 13, I completely understand her fear. (As a side note: changing your phone every year is really, really, really bad for the environment. Be environmentally friendly and use your gadgets for a longer time! I have set a goal to use my phones and other gadgets for at least 5 years.) It has been a few months already and I continue to use some of the authenticators on my old phone because it is such a pain to migrate them to the new one.

Let me quickly go over the experience one by one.

Moving MFA Apps From Phone to Phone

Moving Apple MFA to my new phone was fluent. At the end of the day they are the kings of the experiences and this was expected. Moving WordPress and Facebook was also relatively simple – as long as you manage to sign in to your WordPress and Facebook accounts on your new phone, you will start getting the prompts there.

Moving the Lastpass Authenticator should have been easy but they really screwed up the flow between the actual Lastpass app and the authenticator app. I was clicking like crazy back and forth, going in circles for quite some time until something magical happened and it started working on my new phone. For the accounts where I used Entrust I had to go and register the new phone. Inconvenient, but at least I had a self-service. The problems started appearing when I got to VIP Access – I have to call my financial institution because they are the only ones that can register it. This will mean at least one hour on the phone.

Now, let’s get to the big ones!

Google Authenticator apparently has export functionality that allows you to export the seeds and import them in your new phone. If you know about that, it works like a charm too but… I just recently learned about it from Dave Bittner and Joe Carrigan from The Cyberwire.

Microsoft Authenticator should have been the easiest one (they claim). As long as you are signed in with your Microsoft Account, you should be able to get all the codes on your new phone. Well, king of! This works for other MFA accounts except for Microsoft work and school accounts. With all due respect to my colleagues from Azure Active Directory – the work and school account move sucks! You just need to go and register the new phone with those. Really disappointing!

“Insecure” MFA Methods When Switching Phones

Let me write about the non-secure ways to use MFA!

As I mentioned above, I also use email and SMS for certain accounts. The email experience also sucks! OK, I will be honest – this is certainly my own problem and few people may have this one but this is my rant so I will go with it. I have created many email accounts collected over time (about the reasons for that in some other post). One or another email account is used for MFA depending on what email address I’ve used for registration on a particular website. Those emails are synchronized to a single email account but… the synchronization is on schedule – about 30 mins or so. Now, every MFA code normally expires within 10 mins. Either, I miss the time window to enter the code or I need to login into that particular email account to get the code or I need to force the sync manually (yeah, I can do that but it is annoying. Right, Randall?!). Switching my phone has nothing to do with my emails so – there is no impact.

And the last one – SMS! Well, SMS doesn’t suck! You heard me! SMS doesn’t suck! … most of the time. Sometimes you will not get the text message on time due to networking issues but it works perfectly 99.999% of the time. Oh, and if I switch my phone, it continues to work without any additional configurations, QR codes, or calling my telco – like magic 😉

The Problem With Losing Your Phone

Now, here is where things get serious! Or serious if you use mobile apps for MFA. If you lose your phone, you are screwed. You will be locked out of all your accounts or most of them.

Apple is fine if you are in the Apple ecosystem. I have a few MacBooks, iPad, and AppleWatch – at least one of them will get me the MFA code.

With Facebook, I am screwed unless I am signed in to one of my browsers. For a person like me who uses Facebook once in a blue moon, the probability is low, so if I lose my phone, I am screwed. (Maybe that will push me over the edge to finally stop using them 🙂 ). I assume the WordPress story will be the same as with Facebook. Oh, and have you ever tried to get Facebook support on the phone to help you unlock your account? 🙂

About the other ones…

Backup Codes (or The Problem With Backup Codes)

Well, most of the systems allow you to print a bunch of backup codes that you can store “safely” so if you get locked out you can “easily” sign back in. I emphasize the words “safely” and “easily”. Here is why!

Storing Backup Codes Safely

Define “safely”! The experts recommend that you print your backup codes on paper and store them “safely” offline. I assume that they mean putting it in my safe deposit box at home, right? Because I will need to have easy access to those when I get locked out. It is questionable how safe is that because robberies are not uncommon. I had a colleague who got robbed and he found his safe deposit box in the bushes behind his house – of course, empty! Also, most of those safe deposit boxes are not fire- and waterproof. So, you need to buy a fireproof safe deposit box and cement it in your basement so no grown person (I mean teenager or older) can take it out with a crowbar.

Or, they mean to put it in the safe deposit box in the bank. Where there is security and the probability of robbery is minuscule. But then, I need to run to the bank every time I get locked out.

In both cases, this is not easy. From both a logistics point of view and a usability point of view. At the end of the day, what if I am on a trip and lose my phone (which is a quite realistic scenario).

To avoid all this hassle, most of us find some workarounds. Here are a few for you (and be honest and admit that you do those too):

  • Saving the backup codes as files and putting them in a folder on your laptop.
  • Copying the backup codes and storing them in your password manager (together with your password – how secure is that? 🙂 )
  • Saving the backup codes as files and keeping them on a thumb drive in your drawer.
  • Saving them as files on your Dropbox, OneDrive, Google Drive, or another cloud drive.
  • You see where I am going with those…

To be honest, I even didn’t bother printing/saving the backup codes for some of my accounts (and not all of the systems offer that option), which I assume many of us do.

Even if I print them and store them in my safe, I need to print details of what account they belong to and if they get stolen, all my accounts will be compromised.

Storing the Seeds in the Cloud

Some of the authenticators like Microsoft Authenticator keep your seeds in the cloud. Authe was recommended to me for the same reason. The idea is that if you lose your phone, you can sign in to your authenticator on your new phone and it will sync your seeds on the new phone. Magical, right? Yes, if you are able to sign in to the authenticator on your new phone… without your MFA code. So, you are caught in this vicious circle that if you lose your phone, you will need an MFA code to sign in to your authenticator but you have no way to get the MFA code.

The Solution (Backed by the Numbers)

What are you left with if you lose your phone? The only two MFA methods that work for a lost phone are email and SMS (because even if I lose my phone, I can easily keep my number). They are the most insecure ones but have the lowest risk to get you locked out from your accounts.

I am not promoting the use of SMS and email as the second factor for authentication. But the numbers show that majority of the users who use MFA use SMS instead of an app or a hardware key (see the Twitter report). Let’s run this simple math:

  • Twitter has about 396.5M users.
  • 2.3% (at the time of this writing) use MFA for their Twitter account. This is ~9.12M MFA users.
  • 0.5% of those 9.12M use a hardware key. This is just 456K hardware key users.
  • 30.9% of those 9.12M use an auth app. This is 2.82M auth app users.
  • 79.6% of those 9.12M use SMS. This is 7.26M SMS users.

It would be nice if Twitter had a way to break those numbers down by occupation (although it will be a violation of privacy). Pretty sure they will show that the majority of people who use an auth app or a hardware key work in technology. The normal users who deem their account important protect them with SMS because SMS offers the easiest user experience.

One more thing about SMS. Because everybody is scared to lock themselves out from their accounts, people set up their authenticator app as the primary MFA tool but then they have SMS as the backup. This way, if they lose their phone, they can still gain access to their account using SMS. But as we know, security is as strong as the weakest link – in this case the SMS. The setup of an authenticator app just gives the false illusion of security.

Summary

Using more than one factor for authentication is a MUST. Using stronger authenticators would be nice but with the current experience will be hard to achieve. To convince more people to do that, companies need to offer a much friendlier experience to their users:

  • Freedom to choose authenticator app
  • Self-service
  • Easy recovery

Without those, the usage will be at the current Twitter numbers.

MFA is yet another technology developed without the user in mind. But unfortunately, a technology that is at the core of cybersecurity. It is a shame that the security vendors continue to produce all kinds of new technologies (with fancy names like SOAR, SIEM, EDR, XDR, ML, AI) without fixing the basic user experience.

Friendly face on parachute

How often the following happens to you? You write your client code, you call an API, and receive a 404 Not found response. You start investigating the issue in your code; change a line here or there; spend hours troubleshooting just to find out that the issue is on the server-side, and you can’t do anything about it. Well, welcome to the microservices world! A common mistake I often see developers make is returning an improper response code or passing through the response code from another service.

Let’s see how we can avoid this. But first, a crash course on modern applications implemented with microservices and HTTP status response codes.

How Modern Microservices Applications Work?

I will try to avoid going deep into the philosophical reasons why we need microservices and the benefits (or disadvantages) of using them. This is not the point of this post.

We will start with a simple picture.

Microservices ApplicationAs you can see in the picture, we have a User that interacts with the Client Application that calls Microservice #1 to retrieve some information from the server (aka the cloud 🙂). The Client Application may need to call multiple (micro)services to retrieve all the information the User needs. Still, the part we will concentrate on is that Microservice #1 itself can call other services (Microservice #2 in this simple example) on the backend to perform its business logic. In a complex application (especially if not well architected), the chain of service calls may go well beyond two. But let’s stick with two for now. Also, let’s assume that Microservice #1 and Microservice #2 use REST, and their responses use the HTTP response status codes.

A basic call flow can be something like this. I also include the appropriate HTTP status response codes in each step.

  1. The User clicks on a button in the Client Application.
  2. The Client Application makes an HTTP request to Microservice #1.
  3. Microservice #1 needs additional business logic to complete the request and make an HTTP call to Microservice #2.
  4. Microservice #2 performs the additional business logic and responds to Microservice #1 using a 200 OK response code.
  5. Microservice #1 completes the business logic and responds to the Client Application with a 200 OK response code.
  6. The Client Application performs the action that is attached to the button, and the User is happy.

This is the so-called happy path. Everybody expects the flow to be executed as described above. If everything goes as planned, we don’t need to think anymore and implement the functionality behind the next button. Unfortunately, things often don’t go as planned.

What Can Go Wrong?

Many things! Or at a minimum, the following:

  1. The Client Application fails because of a bug before it even calls Microservice #1.
  2. The Client Application sends invalid input when calling Microservice #1.
  3. Microservice #1 fails before calling Microservice #2.
  4. Microservice #1 sends invalid input when calling Microservice #2.
  5. Microservice #2 fails while performing its business logic.
  6. Microservice #1 fails after calling Microservice #2.
  7. The Client Application fails after Microservice #1 responds.

For those cases (non-happy path? or maybe sad-path? 😉 ) the designers of the HTTP protocol wisely specified two separate sets of response codes:

The guidance for those is quite simple:

  • Client errors should be returned if the client did something wrong. In such cases, the client can change the parameters of the request and fix the issue. The important thing to remember is that the client can fix the issue without any changes on the server-side.
    A typical example is the famous 404 Not found error. If you (the user) mistype the URL path in the browser address bar, the browser (the client application) will request the wrong resource from the server (Microservice #1 in this case). The server (Microservice #1) will respond with a 404 Not found error to the browser (the client application) and the browser will show you “Oops, we couldn’t find the page” message. Well, in the past the browser just showed you the 404 Not found error but we learned a long time ago that this is not user-friendly (You see where I am going with this, right?).
  • Server errors should be returned if the issue occurred on the server-side and the client (and the user) cannot do anything to fix it.
    A simple example is a wrong connection string in the service configuration (Microservice #1 in our case). If the connection string used to configure Microservice #1 with the endpoint and credentials for Microservice #2 is wrong, the client application and the user cannot do anything to fix it. The most appropriate error to return in this case would be 500 Internal server error.

Pretty simple and logical, right? Though, one thing, we as engineers often forget, is who the client and who the server is.

So, Who Is the Client and Who Is the Server?

First, the client and server are two system components that interact directly with each other (think, no intermediaries). If we take the picture from above and change the labels of the arrows, it becomes pretty obvious.

Microservices Application Clients and Servers

We have three clients and three servers:

  • The user is a client of the client application, and the client application is a server for the user.
  • The client application is a client of Microservice #1, and Microservice #1 is a server for the client application.
  • Microservice #1 is a client of Microservice #2, and Microservice #2 is a server for Microservice #1.

Having this picture in mind, the engineers implementing each one of the microservices should think about the most appropriate response code for their immediate client using the guidelines above. It is better if we use examples to explain what response codes each service should return in different situations.

What HTTP Response Codes Should Microservices Return?

A few days ago I was discussing the following situation with one of our engineers. Our service, Azure Container Service (ACR), has a security feature allowing customers to encrypt their container images using customer-managed keys (CMK). For this feature to work, customers need to upload a key in Azure Key Vault (AKV). When the Docker client tries to pull an image, ACR retrieves the key from AKV, decrypts the image, and sends it back to the Docker client. (BTW, I know that ACR and AKV are not microservices 🙂 ) Here is a visual:

Docker pull encrypted image from ACR

In the happy-path scenario, everything works as expected. However, a customer submitted a support request complaining that he is not able to pull his images from ACR. When he tries to pull an image using the Docker client, he receives a 404 Not found error, but when he checks in the Azure Portal, he is able to see the image in the list.

Because the customer couldn’t figure it out by himself, he submitted a support request. The support engineer was also not able to figure out the issue, and had to escalate to the product group. It turned out that the customer deleted the Key Vault and ACR was not able to retrieve the key to decrypt the image. However, the implemented flow looked like this:

  1. Docker client requests an image from ACR.
  2. ACR sees that the image is encrypted and requests the key from the Key Vault.
  3. The Azure Key Vault service looks up the key and figures out that the key (or the whole Key Vault) is missing.
  4. Azure Key Vault returns 404 Not found to ACR for the key ACR tries to access.
  5. ACR passes through the 404 Not found error to the Docker client.
  6. Docker client shows a message to the user that the image cannot be found.

The end result: everybody is confused! Why?

Where the Confusion Comes From?

The investigation chain goes from left to right: Docker client –> ACR –> AKV. Both the customer and the support engineer were concentrated on figuring out why the image is missing in ACR. They were looking only at the Docker client –> ACR part of the chain. The customer’s assumption was that the Docker client is doing something wrong, i.e. requesting the wrong image. This would be the correct assumption because 404 Not found is a client error telling the client that is requesting something that doesn’t exist. Hence, the customer checked the portal and when he saw the image in the list, he was puzzled. The next assumption is that something is wrong on the ACR side. Here is where the customer decided to submit a support request for somebody to check if the data in ACR is corrupted. The support engineer checked the ACR backend and all the data was in sync.

This is a great example where the wrong HTTP response code can send the whole investigation into a rabbit hole. To avoid that, here is the guidance! Microservices should return response codes that are relevant to the business logic they implement and ones that help the client take appropriate actions. “Well”, you will say: “Isn’t that the whole point of HTTP status response codes?” It is! But for whatever reasons, we continue to break this rule. The key words in the above guidance are “the business logic they implement”, not the business logic of the services they call. (By the way, this is the same with exceptions. You don’t catch generic Exception, you catch SpecificException. You don’t pass through exceptions, you catch them and wrap them in a useful way for the calling code).

Business Logic and Friendly HTTP Response Codes

Think about the business logic of each one of the services above!

One way to decide which HTTP response code to return is to think about the resource your microservice is handling. ACR is the service responsible for handling the container images. The business logic that ACR implements should provide status codes relavant to the “business” of images. Azure Key Vault implement business logic that handles keys, secrets, and certificates (not images). Key Vault should return status codes that are relevant to the keys, secrets, and certificates. Azure Key Vault is a downstream service and cannot know what the key is used for, hence cannot provide details to the upstream client (Docker) what the error is. It is responsibility of the ACR to provide the approapriate status code to the upstream client.

Here is how the flow in the above scenario should be implemented:

  1. Docker client requests an image from ACR.
  2. ACR sees that the image is encrypted and requests the key from the Key Vault.
  3. The Azure Key Vault service looks up the key and figures out that the key (or the whole Key Vault) is missing.
  4. Azure Key Vault returns 404 Not found to ACR for the key ACR tries to access.
  5. ACR handles the 404 Not found from Azure Key Vault but wraps it in a error that is relevant to the requested image.
  6. Instead 404 Not found, ACR returns 500 Internal server error with a message clarifying the issue.
  7. Docker client shows a message to the user that it cannot pull the image because of an issue on the server.

The Q&A Approach

Another way that you can use to decide what response code to return is to take the Questions-and-Answers approach and build a simple IF-THEN logic (aka. decition tree). Here is how this can work for our example:

  • Docker: Pull image from ACR
    • ACR: Q: Is the image ivailable?
      • A: Yes
        (Note to myself: Requesting the image cannot be a client error anymore.)

        • Q: Is the image encrypted?
          • A: Yes
            • ACR: Request the key from Key Vault
              • AKV: Q: Is the key available?
                • A: Yes
                  • AKV: Return the key to ACR
                • A: No
                  • AKV: Return 404 [key] Not found error
            • ACR: Q: Did I get a key?
              • A: Yes
                • ACR: Decrypt the image
                • ACR: Return 200 OK with the image payload
              • A: No (I got 404 [key] Not found)
                • ACR: I cannot decrypt the image
                  (Note to myself: There is nothing the client did wrong! It is all the server fault)
                • ACR: Return 500 Internal server error “I cannot decrypt the image”
          • A: No (image is not encrypted)
            • ACR: Return 200 OK with the image payload
      • A: No (image does not exist)
        • ACR: Return 404 [image] Not found error

Note that the above flow is simplified. For example, in a real implementation, you may need to check if the client is authenticated and authorized to pull the image. Nevertheless, the concept is the same – you will just need to have more Q&As.

Summary

As you can see, it is important to be careful what HTTP response codes you return from your microservices. If you return the wrong message, you may end up with more work than you expect. Here are the main points that is worth remembering:

  • Return 400 errors only if the client can do something to fix the issue. If the client cannot do anything to fix it, 500 errors are the only appropriate ones.
  • Do not pass through the response codes you receive from upstream services. Handle each response from upstream services and wrap it according to the business logic you are implementing.
  • When implementing your services, think about the resource you are handling in those services. Return HTTP status response codes that are relevant to the resource you are handling.
  • Use the Q&A approach to decide what is the appropriate response code to return for your service and the resource that is requested by the client.

By using those guidelines, your microservices will become more friendly and easier to troubleshoot.

Featured image by Nick Page on Unsplash

In the last few months, I started seeing more and more customers using Azure Container Registry (or ACR) for storing their Helm charts. However, many of them are confused about how to properly push and use the charts stored in ACR. So, in this post, I will document a few things that need the most clarifications. Let’s start with some definitions!

Helm 2 and Helm 3 – what are those?

Before we even start!

Helm 2 is NOT supported and you should not use it! Period! If you need more details just read Helm’s blog post Helm 2 and the Charts Project Are Now Unsupported from Fri, Nov 13, 2020.

A nice date to choose for that announcement 🙂

OK, really – what are Helm 2 and Helm 3? When somebody says Helm 2 or Helm 3, they most often mean the version of the Helm CLI (i.e., Command Line Interface). The easiest way to check what version of the Helm CLI you have is to type:

$ helm version

in your Terminal. If the version is v2.x.x , then you have Helm (CLI) 2; if the version is v3.x.x then, you have Helm (CLI) 3. But, it is not that simple! You should also consider the API version. The API version is the version that you specify at the top of your chart (i.e. Chart.yaml) – you can read more about it in Helm Charts documentation. Here is a table that can come in handy for you:

wdt_ID apiVersion Helm 2 CLI Support Helm 3 CLI Support
1 v1 Yes Yes
2 v2 No Yes

What this table tells you is that Helm 2 CLI supports apiVersion V1, while Helm 3 CLI supports apiVersion V1 and V2. You should check the Helm Charts documentation linked above if you need more details about the differences, but the important thing to remember here is that Helm 3 CLI supports old charts, and (once again) there is no reason for you to use Helm 2.

We’ve cleared (I hope) the confusion around Helm 2 and Helm 3. Let’s see how ACR handles the Helm charts. For each one of those experiences, I will walk you step-by-step.

ACR and Helm 2

Azure Container Registry allows you to store Helm charts using the Helm 2 way (What? I will explain in a bit). However:

Helm 2 is NOT supported and you should not use it!

Now that you have been warned (twice! or was it three times?) let’s see how ACR handles Helm 2 charts. To avoid any ambiguity, I will use Helm CLI v2.17.0 for this exercise. At the time of this writing, it is the last published version of Helm 2.

$ helm version
Client: &version.Version{SemVer:"v2.17.0", GitCommit:"a690bad98af45b015bd3da1a41f6218b1a451dbe", GitTreeState:"clean"}

Initializing Helm and the Repository List

If you have a brand new installation of the Helm 2 CLI, you should initialize Helm and add the ACR to your repository list. You start with:

$ helm init
$ helm repo list
NAME     URL
stable   https://charts.helm.sh/stable
local    http://127.0.0.1:8879/charts

to initialize Helm and see the list of available repositories. Then you can add your ACR to the list by typing:

$ helm repo add --username <acr_username> --password <acr_password> <repo_name> https://<acr_login_server>/helm/v1/repo

For me, this looked like this:

$ helm repo add --username <myacr_username> --password <myacr_password> acrrepo https://tsmacrtestwus2acrhelm.azurecr.io/helm/v1/repo

Here is something very important: you must use the /helm/v1/repo path! If you do not specify the path you will see either 404: Not found error (for example, if you use the root URL without a path) or 403: Forbidden error (if you decide that you want to rename the repo part to something else).

I also need to make a side note here because the authentication can be a bit tricky. The following section applies to both Helm 2 and Helm 3.

Signing In To ACR Using the Helm (any version) CLI

Before you push and pull charts from ACR, you need to sign in. There are few different options that you can use to sign in to ACR using the CLI:

  • Using ACR Admin user (not recommended)
    If you have ACR Admin user enabled, you can use the Admin user username and password to sign in to ACR by simply specifying the --username and --password parameters for the Helm command.
  • Using a Service Principal (SP)
    If you need to push and pull charts using automation, you have most probably already set up a service principal for that. You can authenticate using the SP credentials by passing the app ID in the  --username and the client secret in the --password parameters for the Helm command. Make sure you assign the appropriate role to the service principal to allow access to your registry.
  • Using your own (user) credentials
    This one is the tricky one, and it is described in the ACR docs in az acr login with –expose-token section of the Authentication overview article. For this one, you must use the Azure CLI to obtain the token. Here are the steps:

    • Use the Azure CLI to sign in to Azure using your own credentials:
      $ az login

      This will pop up a browser window or give you an URL with a special code to use.

    • Next, sign in to ACR using the Azure CLI and add the --expose-token parameter:
      $ az acr login --name <acr_name_or_login_server> --expose-token

      This will sign you into ACR and will print an access token that you can use to sign in with other tools.

    • Last, you can sign in using the Helm CLI by passing a GUID-like string with zeros only (exactly this string 00000000-0000-0000-0000-000000000000) in the  --username parameter and the access token in the --password parameter. Here is how the command to add the Helm repository will look like:
      $ helm repo add --username "00000000-0000-0000-0000-000000000000" --password "eyJhbGciOiJSUzI1NiIs[...]24V7wA" <repo_name> https://<acr_login_server>/helm/v1/repo

Creating and Packaging Charts with Helm 2 CLI

Helm 2 doesn’t have out-of-the-box experience for pushing charts to a remote chart registry. You may wrongly assume that the helm-push plugin is the one that does that, but you will be wrong. This plugin will only allow you to push charts to Chartmuseum (although I can use it to try to push to any repo but will fail – a topic for another story). Helm’s guidance on how chart repositories should work is described in the documentation (… and this is the Helm 2 way that I mentioned above):

  • According to Chart Repositories article in Helm documentation, the repository is a simple web server that serves the index.yaml file that points to the chart TAR archives. The TAR archives can be served by the same web server or from other locations like Azure Storage.
  • In Store charts in your chart repository they describe the process to generate the index.yaml file and how to upload the necessary artifacts to static storage to serve them.

Disclaimer: the term Helm 2 way is my own term based on my interpretation of how things work. It allows me to refer to the two different approaches charts are saved. It is not an industry term not something that Helm refers to or uses.

I have created a simple chart called helm-test-chart-v2 on my local machine to test the push. Here is the output from the commands:

$ $ helm create helm-test-chart-v2
Creating helm-test-chart-v2

$ ls -al ./helm-test-chart-v2/
total 28
drwxr-xr-x 4 azurevmuser azurevmuser 4096 Aug 16 16:44 .
drwxr-xr-x 6 azurevmuser azurevmuser 4096 Aug 17 16:29 ..
-rw-r--r-- 1 azurevmuser azurevmuser 342 Aug 16 16:44 .helmignore
-rw-r--r-- 1 azurevmuser azurevmuser 114 Aug 16 16:44 Chart.yaml
drwxr-xr-x 2 azurevmuser azurevmuser 4096 Aug 16 16:44 charts
drwxr-xr-x 3 azurevmuser azurevmuser 4096 Aug 16 16:44 templates
-rw-r--r-- 1 azurevmuser azurevmuser 1519 Aug 16 16:44 values.yaml

$ helm package ./helm-test-chart-v2/
Successfully packaged chart and saved it to: /home/azurevmuser/helm-test-chart-v2-0.1.0.tgz

$ ls -al
total 48
drwxr-xr-x 6 azurevmuser azurevmuser 4096 Aug 17 16:31 .
drwxr-xr-x 3 root root 4096 Aug 14 14:12 ..
-rw------- 1 azurevmuser azurevmuser 780 Aug 15 22:48 .bash_history
-rw-r--r-- 1 azurevmuser azurevmuser 220 Feb 25 2020 .bash_logout
-rw-r--r-- 1 azurevmuser azurevmuser 3771 Feb 25 2020 .bashrc
drwx------ 2 azurevmuser azurevmuser 4096 Aug 14 14:15 .cache
drwxr-xr-x 6 azurevmuser azurevmuser 4096 Aug 15 21:46 .helm
-rw-r--r-- 1 azurevmuser azurevmuser 807 Feb 25 2020 .profile
drwx------ 2 azurevmuser azurevmuser 4096 Aug 14 14:12 .ssh
-rw-r--r-- 1 azurevmuser azurevmuser 0 Aug 14 14:18 .sudo_as_admin_successful
-rw------- 1 azurevmuser azurevmuser 1559 Aug 14 14:26 .viminfo
drwxr-xr-x 4 azurevmuser azurevmuser 4096 Aug 16 16:44 helm-test-chart-v2
-rw-rw-r-- 1 azurevmuser azurevmuser 3269 Aug 17 16:31 helm-test-chart-v2-0.1.0.tgz

Because Helm 2 doesn’t have a push chart functionality, the implementation is left up to the vendors. ACR has provided proprietary implementation (already deprecated, which is another reason to not use Helm 2) of the push chart functionality that is built into the ACR CLI.

Pushing and Pulling Charts from ACR Using Azure CLI (Helm 2)

Let’s take a look at how you can push Helm 2 charts to ACR using the ACR CLI. First, you need to sign in to Azure, and then to your ACR. Yes, this is correct; you need to use two different commands to sign into the ACR. Here is how this looks like for my ACR registry:

$ az login
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code AABBCCDDE to authenticate.
[
  {
    "cloudName": "AzureCloud",
    "homeTenantId": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
    "id": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
    "isDefault": true,
    "managedByTenants": [],
    "name": "ToddySM Sandbox",
    "state": "Enabled",
    "tenantId": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
    "user": {
        "name": "toddysm_XXXXXXXX@outlook.com",
        "type": "user"
    }
  }
]
$ az acr login --name tsmacrtestwus2acrhelm.azurecr.io
The login server endpoint suffix '.azurecr.io' is automatically omitted.
You may want to use 'az acr login -n tsmacrtestwus2acrhelm --expose-token' to get an access token, which does not require Docker to be installed.
An error occurred: DOCKER_COMMAND_ERROR
Please verify if Docker client is installed and running.

Well, I do not have Docker running, but this is OK – you don’t need Docker installed for pushing the Helm chart. Though, it may be confusing because it leaves the impression that you may not be signed in to the ACR registry.

We will push the chart that we packaged already. Pushing it is done with the (deprecated) built-in command in ACR CLI. Here is the output:

$ az acr helm push --name tsmacrtestwus2acrhelm.azurecr.io helm-test-chart-v2-0.1.0.tgz
This command is implicitly deprecated because command group 'acr helm' is deprecated and will be removed in a future release. Use 'helm v3' instead.
The login server endpoint suffix '.azurecr.io' is automatically omitted.
{
  "saved": true
}

This seems to be successful, and I have a Helm chart pushed to ACR using the Helm 2 way (i.e. using the proprietary and deprecated ACR CLI implementation). The problem here is that it is hard to verify that the chart is pushed to the ACR registry. If you go to the portal, you will not see the repository that contains the chart. Here is a screenshot of my registry view in the Azure portal after I pushed the chart:

Azure Portal Not Listing Helm 2 ChartsAs you can see, the Helm 2 chart repository doesn’t appear in the list of repositories in the Azure portal, and you will not be able to browse the charts in that repository using the Azure portal. However, if you use the Helm command to search for the chart, the result will include the ACR repository. Here is the output from the command in my environment:

$ helm search helm-test-chart-v2
NAME                           CHART VERSION        APP VERSION        DESCRIPTION
acrrepo/helm-test-chart-v2     0.1.0                1.0                A Helm chart for Kubernetes
local/helm-test-chart-v2       0.1.0                1.0                A Helm chart for Kubernetes

Summary of the ACR and Helm 2 Experience

To summarize the ACR and Helm 2 experience, here are the main takeaways:

  • First, you should not use Helm 2 CLI and the proprietary ACR CLI implementation for working with Helm charts!
  • There is no push functionality for charts in the Helm 2 client and each vendor is implementing their own CLI for pushing charts to the remote repositories.
  • When you add ACR repository using the Helm 2 CLI you should use the following URL format https://<acr_login_server>/helm/v1/repo
  • If you push a chart to ACR using the ACR CLI implementation you will not see the chart in Azure Portal. The only way to verify that the chart is pushed to the ACR repository is to use the helm search command.

ACR and Helm 3

Once again, to avoid any ambiguity, I will use Helm CLI v3.6.2 for this exercise. Here is the complete version string:

PS C:> helm version
version.BuildInfo{Version:"v3.6.2", GitCommit:"ee407bdf364942bcb8e8c665f82e15aa28009b71", GitTreeState:"clean", GoVersion:"go1.16.5"}

Yes, I run this one in PowerShell terminal 🙂 And, of course, not in the root folder 😉 You can convert the commands to the corresponding Linux commands and prompts.

Let’s start with the basic thing!

Creating and Packaging Charts with Helm 3 CLI

There is absolutely no difference between the Helm 2 and Helm 3 experience for creating and packaging a chart. Here is the output:

PS C:> helm create helm-test-chart-v3
Creating helm-test-chart-v3

PS C:> ls .\helm-test-chart-v3\

    Directory: C:\Users\memladen\Documents\Development\Local\helm-test-chart-v3

Mode         LastWriteTime         Length         Name
----         -------------         ------         ----
d----    8/17/2021 9:42 PM                        charts
d----    8/17/2021 9:42 PM                        templates
-a---    8/17/2021 9:42 PM            349         .helmignore
-a---    8/17/2021 9:42 PM           1154         Chart.yaml
-a---    8/17/2021 9:42 PM           1885         values.yaml

PS C:> helm package .\helm-test-chart-v3\
Successfully packaged chart and saved it to: C:\Users\memladen\Documents\Development\Local\helm-test-chart-v3-0.1.0.tgz

PS C:> ls helm-test-*

    Directory: C:\Users\memladen\Documents\Development\Local

Mode         LastWriteTime         Length         Name
----         -------------         ------         ----
d----    8/17/2021 9:42 PM                        helm-test-chart-v3
-a---    8/17/2021 9:51 PM           3766         helm-test-chart-v3-0.1.0.tgz

From here on, though, things can get confusing! The reason is that you have two separate options to work with charts using Helm 3.

Using Helm 3 to Push and Pull Charts the Helm 2 Way

You can use Helm 3 to push the charts the same way you do that with Helm 2. First, you add the repo:

PS C:> helm repo add --username <myacr_username> --password <myacr_password> acrrepo https://tsmacrtestwus2acrhelm.azurecr.io/helm/v1/repo
"acrrepo" has been added to your repositories

PS C:> helm repo list
NAME         URL
microsoft    https://microsoft.github.io/charts/repo
acrrepo      https://tsmacrtestwus2acrhelm.azurecr.io/helm/v1/repo

Then, you can update the repositories and search for a chart:

PS C:> helm repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "microsoft" chart repository
...Successfully got an update from the "acrrepo" chart repository
Update Complete. ⎈Happy Helming!⎈

PS C:> helm search repo helm-test-chart
NAME                          CHART VERSION     APP VERSION     DESCRIPTION
acrrepo/helm-test-chart-v2    0.1.0             1.0             A Helm chart for Kubernetes

Ha, look at that! I can see the chart that I pushed using the ACR CLI in the ACR and Helm 2 section above – notice the chart name and the version. Also, notice that the Helm 3 search command has a bit different syntax – it wants you to clarify what you want to search (repo in our case).

I can use the ACR CLI to push the new chart that I just created using the Helm 3 CLI (after signing in to Azure):

PS C:> az acr helm push --name tsmacrtestwus2acrhelm.azurecr.io .\helm-test-chart-v3-0.1.0.tgz
This command is implicitly deprecated because command group 'acr helm' is deprecated and will be removed in a future release. Use 'helm v3' instead.
The login server endpoint suffix '.azurecr.io' is automatically omitted.
{
  "saved": true
}

By doing this, I have pushed the V3 chart to ACR and can pull it from there but, remember, this is the Helm 2 Way and the following are still true:

  • You will not see the chart in Azure Portal.
  • The only way to verify that the chart is pushed to the ACR repository is to use the helm search command.

Here is the result of the search command after updating the repositories:

PS C:> helm repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "microsoft" chart repository
...Successfully got an update from the "acrrepo" chart repository
Update Complete. ⎈Happy Helming!⎈

PS C:> helm search repo helm-test-chart
NAME                           CHART VERSION     APP VERSION     DESCRIPTION
acrrepo/helm-test-chart-v2     0.1.0             1.0             A Helm chart for Kubernetes
acrrepo/helm-test-chart-v3     0.1.0             1.16.0          A Helm chart for Kubernetes

You can see both charts, the one created with Helm 2 and the one created with Helm 3, available. This is understandable though because I pushed both charts the same way – by using the az acr helm command. Remember, though – both charts are stored in ACR using the Helm 2 way.

Using Helm 3 to Push and Pull Charts the OCI Way

Before proceeding, I changed the version property in the Chart.yaml to 0.2.0 to be able to differentiate between the charts I pushed. This is the same chart that I created in the previous section Creating and Packaging Charts with Helm 3 CLI.

You may have noticed that Helm 3 has a new chart command. This command allows you to (from the help text) “push, pull, tag, list, or remove Helm charts”. The subcommands under the chart command are experimental and you need to set the HELM_EXPERIMENTAL_OCI environment variable to be able to use them. Once you do that, you can save the chart. You can save the chart to the local registry cache with or without a registry FQDN (Fully Qualified Domain Name). Here are the commands:

PS C:> $env:HELM_EXPERIMENTAL_OCI=1

PS C:> helm chart save .\helm-test-chart-v3\ helm-test-chart-v3:0.2.0
ref:     helm-test-chart-v3:0.2.0
digest:  b6954fb0a696e1eb7de8ad95c59132157ebc061396230394523fed260293fb19
size:    3.7 KiB
name:    helm-test-chart-v3
version: 0.2.0
0.2.0: saved
PS C:> helm chart save .\helm-test-chart-v3\ tsmacrtestwus2acrhelm.azurecr.io/charts/helm-test-chart-v3:0.2.0
ref:     tsmacrtestwus2acrhelm.azurecr.io/charts/helm-test-chart-v3:0.2.0
digest:  ff5e9aea6d63d7be4bb53eb8fffacf12550a4bb687213a2edb07a21f6938d16e
size:    3.7 KiB
name:    helm-test-chart-v3
version: 0.2.0
0.2.0: saved

If you list the charts using the new chart command, you will see the following:

PS C:> helm chart list
REF                                                                  NAME                   VERSION     DIGEST     SIZE     CREATED
helm-test-chart-v3:0.2.0                                             helm-test-chart-v3     0.2.0       b6954fb    3.7 KiB  About a minute
tsmacrtestwus2acrhelm.azurecr.io/charts/helm-test-chart-v3:0.2.0     helm-test-chart-v3     0.2.0       ff5e9ae    3.7 KiB  About a minute

Few things to note here:

  • Both charts are saved in the local registry cache. Nothing is pushed yet to a remote registry.
  • You see only charts that are saved the OCI way. The charts saved the Helm 2 way are not listed using the helm chart list command.
  • The REF (or reference) for a chart can be truncated and you may not be able to see the full reference.

Let’s do one more thing! Let’s save the same chart with FQDN for the ACR as above but under a different repository. Here are the commands to save and list the charts:

PS C:> helm chart save .\helm-test-chart-v3\ tsmacrtestwus2acrhelm.azurecr.io/my-helm-charts/helm-test-chart-v3:0.2.0
ref: tsmacrtestwus2acrhelm.azurecr.io/my-helm-charts/helm-test-chart-v3:0.2.0
digest: daf106a05ad2fe075851a3ab80f037020565c75c5be06936179b882af1858e6a
size: 3.7 KiB
name: helm-test-chart-v3
version: 0.2.0
0.2.0: saved

PS C:> helm chart list
REF                                                                  NAME                   VERSION     DIGEST     SIZE     CREATED
helm-test-chart-v3:0.2.0                                             helm-test-chart-v3     0.2.0       b6954fb    3.7 KiB  11 minutes
tsmacrtestwus2acrhelm.azurecr.io/charts/helm-test-chart-v3:0.2.0     helm-test-chart-v3     0.2.0       ff5e9ae    3.7 KiB  11 minutes
tsmacrtestwus2acrhelm.azurecr.io/my-helm-charts/helm-test-chart...   helm-test-chart-v3     0.2.0       daf106a    3.7 KiB  About a minute

After doing this, we have three charts in the local registry:

  • helm-test-chart-v3:0.2.0 that is available only locally.
  • tsmacrtestwus2acrhelm.azurecr.io/charts/helm-test-chart-v3:0.2.0 that can be pushed to the remote ACR registry tsmacrtestwus2acrhelm.azurecr.io and saved in the charts repository.
  • and tsmacrtestwus2acrhelm.azurecr.io/my-helm-charts/helm-test-chart-v3:0.2.0 that can be pushed to the remote ACR registry tsmacrtestwus2acrhelm.azurecr.io and saved in the my-helm-charts repository.

Before we can push the charts to the ACR registry, we need to sign in using the following command:

PS C:> helm registry login tsmacrtestwus2acrhelm.azurecr.io --username <myacr_username> --password <myacr_password>

You can use any of the sign-in methods described in Signing in to ACR Using the Helm CLI section. And make sure you use your own ACR registry login server.

If we push the two charts that have the ACR FQDN, we will see them appear in the Azure portal UI. Here are the commands:

PS C:> helm chart push tsmacrtestwus2acrhelm.azurecr.io/charts/helm-test-chart-v3:0.2.0
The push refers to repository [tsmacrtestwus2acrhelm.azurecr.io/charts/helm-test-chart-v3]
ref: tsmacrtestwus2acrhelm.azurecr.io/charts/helm-test-chart-v3:0.2.0
digest: daf106a05ad2fe075851a3ab80f037020565c75c5be06936179b882af1858e6a
size: 3.7 KiB
name: helm-test-chart-v3
version: 0.2.0
0.2.0: pushed to remote (1 layer, 3.7 KiB total)

PS C:> helm chart push tsmacrtestwus2acrhelm.azurecr.io/my-helm-charts/helm-test-chart-v3:0.2.0
The push refers to repository [tsmacrtestwus2acrhelm.azurecr.io/my-helm-charts/helm-test-chart-v3]
ref: tsmacrtestwus2acrhelm.azurecr.io/my-helm-charts/helm-test-chart-v3:0.2.0
digest: daf106a05ad2fe075851a3ab80f037020565c75c5be06936179b882af1858e6a
size: 3.7 KiB
name: helm-test-chart-v3
version: 0.2.0
0.2.0: pushed to remote (1 layer, 3.7 KiB total)

And here is the result:

An important thing to note here is that:

  • Helm charts saved to ACR using the OCI way will appear in the Azure portal.

The approach here is a bit different than the Helm 2 way. You don’t need to package the chart into a TAR – saving the chart to the local registry is enough.

We need to do one last thing and we are ready to summarize the experience. Let’s use the helm search command to find our charts (of course using Helm 3). Here is the result of the search:

PS C:> helm search repo helm-test-chart 
NAME                           CHART VERSION     APP VERSION     DESCRIPTION 
acrrepo/helm-test-chart-v2     0.1.0             1.0             A Helm chart for Kubernetes 
acrrepo/helm-test-chart-v3     0.1.0             1.16.0          A Helm chart for Kubernetes

It yields the same result like the one we saw in Using Helm 3 to Push and Pull Charts the Helm 2 Way. The reason is that the helm search command doesn’t work for charts stored the OCI way. This is one limitation that the Helm team is working on fixing and is documented in Support for OCI registries in helm search #9983 issue on GitHub.

Summary of the ACR and Helm 3 Experience

To summarize the ACR and Helm 3 experience, here are the main takeaways:

  • First, you can use the Helm 3 CLI in conjunction with the az acr helm command to push and pull charts the Helm 2 way. Those charts will not appear in the Azure portal.
  • You can also use the Helm 3 CLI to (natively) push charts to ACR the OCI way. Those charts will appear in the Azure portal.
  • OCI features are experimental in the Helm 3 client and certain functionalities like helm search and helm repo do not work for charts saved and pushed the OCI way.

Conclusion

To wrap it up, when working with Helm charts and ACR (as well as other OCI compliant registries), you need to be careful which commands you use. As a general rule, always use the Helm 3 CLI and make a conscious decision whether you want to store the charts as OCI artifacts (the OCI way) or using the legacy Helm approach (the Helm 2 way). This should be a transition period and hopefully, at some point in the future, Helm will improve the support for OCI compliant charts and support the same scenarios that are currently enabled for legacy chart repositories.

Here is a summary table that gives a quick overview of what we described in this post.

wdt_ID Functionality Helm 2 CLI (legacy) Helm 3 (legacy) Helm 3 (OCI)
1 helm add repo Yes Yes No
2 helm search Yes Yes No
3 helm chart push No No Yes
4 helm chart list No No Yes
5 az acr helm push Yes Yes No
6 Chart appears in Azure portal No No Yes
7 Example chart helm-test-chart-v2 helm-test-chart-v3 helm-test-chart-v3
8 Example chart version 0.1.0 0.1.0 0.2.0

In the last two posts Configuring a hierarchy of IoT Edge Devices at Home Part 1 – Configuring the IT Proxy and Configuring a hierarchy of IoT Edge Devices at Home Part 2 – Configuring the Enterprise Network (IT) we have set up the proxy and the top layer of the hierarchical IoT Edge network. This post will describe how to set up and configure the first nested layer in the Purdue network architecture for manufacturing – the Business Planning and Logistics (IT) layer. So, let’s get going!

A lot of the steps to configure the IoT Edge runtime on the device are repeating from the previous post. Nevertheless, though, let’s go over them again.

Configuring the Network for Layer 4 Device

The tricky part with the Layer 4 device is that it should have no Internet access. The problem with that is that you will need Internet access to install the IoT Edge runtime. In a typical scenario, the Layer 4 device will come with the IoT Edge runtime pre-installed and you will only need to plug in the device and configure it to talk to its parent. For our scenario though, we need to have the device Internet-enabled while installing the IoT Edge and lock it down after that. Here is the initial dhcpcd.conf configuration for the Layer 4 device:

interface eth0
static ip_address=10.16.6.4/16
static routers=10.16.8.4
static domain_name_servers=1.1.1.1

The only difference in the network configuration is the IP address of the device. This configuration will allow us to download the necessary software to the device before we restrict its access.

While speaking of necessary software, let’s install the netfilter-persistent and iptables-persistent and have them ready for locking the device networking later on:

sudo DEBIAN_FRONTEND=noninteractive apt install -y netfilter-persistent iptables-persistent

Create the Azure Cloud Resources

We already have created the resource group and the IoT Hub resource in Azure. The only new resource that we need to create is the IoT Edge device for Layer 4. One difference here is that, unlike the Layer 5 device, the Layer 4 device will have a parent. And this parent will be the Layer 5 device. Use the following command to create the Layer 4 device

az iot hub device-identity create --device-id L4-edge-pi --edge-enabled --hub-name tsm-ioth-eus-test
az iot hub device-identity parent set --device-id L4-edge-pi --parent-device-id L5-edge-pi --hub-name tsm-ioth-eus-test
az iot hub device-identity connection-string show --device-id L5-edge-pi --hub-name tsm-ioth-eus-test

The first command creates the IoT Edge device for Layer 4. The second command sets the Layer 5 device as the parent for the Layer 4 device. And, the third command returns the connection string for the Layer 4 device that we can use for the IoT Edge runtime configuration.

Installing Azure IoT Edge Runtime on the Layer 4 Device

Surprisingly to me, it the time between posting my last post and starting this one, the Tutorial to Create a Hierarchy of IoT Edge Devices changed. I will still use the steps from my previous post to provide consistency.

Creating Certificates

We will start again with the creation of certificates – this time for the Layer 4 device. We already have the root and the intermediate certificates for the chain. The only certificate that we need to create is the Layer 4 device certificate. We will use the following command for that:

./certGen.sh create_edge_device_ca_certificate "l4_certificate"

You should see the new certificates in the <WORKDIR>/certs folder:

drwxrwxrwx 1 toddysm toddysm 4096 Apr 16 10:46 .
drwxrwxrwx 1 toddysm toddysm 4096 Apr 16 10:46 ..
...
-rwxrwxrwx 1 toddysm toddysm 5891 Apr 16 10:46 iot-edge-device-ca-l4_certificate-full-chain.cert.pem
-r-xr-xr-x 1 toddysm toddysm 1931 Apr 16 10:46 iot-edge-device-ca-l4_certificate.cert.pem
-rwxrwxrwx 1 toddysm toddysm 7240 Apr 16 10:46 iot-edge-device-ca-l4_certificate.cert.pfx
...

The private keys are in the <WORKDIR>/private folder together with the root and intermediate keys:

drwxrwxrwx 1 toddysm toddysm 4096 Apr 16 10:46 .
drwxrwxrwx 1 toddysm toddysm 4096 Apr 16 10:46 ..
-r-xr-xr-x 1 toddysm toddysm 3326 Apr  8 14:54 azure-iot-test-only.intermediate.key.pem
-r-xr-xr-x 1 toddysm toddysm 3326 Apr  8 14:54 azure-iot-test-only.root.ca.key.pem
-r-xr-xr-x 1 toddysm toddysm 3243 Apr 16 10:46 iot-edge-device-ca-l4_certificate.key.pem
...

As before, we need to upload the relevant certificates to the Layer 4 device. From the <WORKDIR> folder on the workstation use the following commands:

scp ./certs/azure-iot-test-only.root.ca.cert.pem pi@pi-l4:.
scp ./certs/iot-edge-device-ca-l4_certificate-full-chain.cert.pem pi@pi-l4:.
scp ./private/iot-edge-device-ca-l4_certificate.key.pem pi@pi-l4:.

Connect to the Layer 4 device with ssh pi@pi-l4 and install the root CA using:

sudo cp ~/azure-iot-test-only.root.ca.cert.pem /usr/local/share/ca-certificates/azure-iot-test-only.root.ca.cert.pem.crt
sudo update-ca-certificates

Verify that the cert is installed with:

ls /etc/ssl/certs/ | grep azure

We will also move the device certificate chain and the private key to the /var/secrets folder:

sudo mkdir /var/secrets
sudo mkdir /var/secrets/aziot
sudo mv iot-edge-device-ca-l4_certificate-full-chain.cert.pem /var/secrets/aziot/
sudo mv iot-edge-device-ca-l4_certificate.key.pem /var/secrets/aziot/

We are all set to install and configure the Azure IoT Edge runtime.

Configuring the IoT Edge Runtime

As mentioned before the steps are described in Install or uninstall Azure IoT Edge for Linux and here is the configuration that we need to use in /etc/aziot/config.tomlfile:

hostname = "10.16.6.4"

parent_hostname = "10.16.7.4"

trust_bundle_cert = "file:///etc/ssl/certs/azure-iot-test-only.root.ca.cert.pem.pem"

[provisioning]
source = "manual"
connection_string = "HostName=tsm-ioth-eus-test.azure-devices.net;DeviceId=L4-edge-pi;SharedAccessKey=s0m#VeRYcRYpT1c$tR1nG"

[agent]
name = "edgeAgent"
type = "docker"
[agent.config]
image = "10.16.7.4:<port>/azureiotedge-agent:1.2"

[edge_ca]
cert = "file:///var/secrets/aziot/iot-edge-device-ca-l4_certificate-full-chain.cert.pem"
pk = "file:///var/secrets/aziot/iot-edge-device-ca-l4_certificate.key.pem"

Note two differences in this configuration compared to the configuration for the Layer 5 device:

  1. There is a parent_hostname configuration that has the Layer 5 device’s IP address as a value.
  2. The agent image is pulled from the parent registry 10.16.7.4:<port>/azureiotedge-agent:1.2 instead from mcr.microsoft.com (don’t forget to remove/change the <port> in the configuration depending on how you set up the parent registry).

Also, if the registry running on your Layer 5 device requires authentication, you will need to provide credentials in the [agent.config.auth] section.

Restricting the Network for the Layer 4 Device

Now, that we have the IoT Edge runtime configured, we need to restrict the traffic to and from the Layer 4 device. Here is what are we going to do:

  • We will allow SSH connectivity only from the IP addresses in the Demo/Support network, i.e. where our workstation is. However, we will use a stricter CIDR to allow only  10.16.9.x addresses. Here are the commands that will make that configuration:
    sudo iptables -A INPUT -i eth0 -p tcp --dport 22 -s 10.16.9.0/24 -m state --state NEW,ESTABLISHED -j ACCEPT
    sudo iptables -A OUTPUT -p tcp --sport 22 -d 10.16.9.0/24 -m state --state ESTABLISHED -j ACCEPT
  • We will enable connectivity on the following ports 8080 (or whatever the registry port is), 8883, 443, and 5671 to the Layer 5 network only, i.e. 10.16.7.x addresses. Here the commands that will make that configuration:
    sudo iptables -A OUTPUT -p tcp --dport <registry-port> -d 10.16.7.0/24 -m state --state NEW,ESTABLISHED -j ACCEPT
    sudo iptables -A INPUT -i eth0 -p tcp --sport <registry-port> -s 10.16.7.0/24 -m state --state ESTABLISHED -j ACCEPT
    sudo iptables -A OUTPUT -p tcp --dport 8883 -d 10.16.7.0/24 -m state --state NEW,ESTABLISHED -j ACCEPT
    sudo iptables -A INPUT -i eth0 -p tcp --sport 8883 -s 10.16.7.0/24 -m state --state ESTABLISHED -j ACCEPT
    sudo iptables -A OUTPUT -p tcp --dport 5671 -d 10.16.7.0/24 -m state --state NEW,ESTABLISHED -j ACCEPT
    sudo iptables -A INPUT -i eth0 -p tcp --sport 5671 -s 10.16.7.0/24 -m state --state ESTABLISHED -j ACCEPT
    sudo iptables -A OUTPUT -p tcp --dport 443 -d 10.16.7.0/24 -m state --state NEW,ESTABLISHED -j ACCEPT
    sudo iptables -A INPUT -i eth0 -p tcp --sport 443 -s 10.16.7.0/24 -m state --state ESTABLISHED -j ACCEPT
    
  • We will also enable connectivity on the following ports 8080 (or whatever the registry port is), 8883, 443, and 5671 from the OT Proxy network only, i.e. 10.16.5.x addresses. Here the commands that will make that configuration:
    sudo iptables -A OUTPUT -p tcp --dport <registry-port> -d 10.16.5.0/24 -m state --state ESTABLISHED -j ACCEPT
    sudo iptables -A INPUT -i eth0 -p tcp --sport <registry-port> -s 10.16.5.0/24 -m state --state NEW,ESTABLISHED -j ACCEPT
    sudo iptables -A OUTPUT -p tcp --dport 8883 -d 10.16.5.0/24 -m state --state ESTABLISHED -j ACCEPT
    sudo iptables -A INPUT -i eth0 -p tcp --sport 8883 -s 10.16.5.0/24 -m state --state NEW,ESTABLISHED -j ACCEPT
    sudo iptables -A OUTPUT -p tcp --dport 5671 -d 10.16.5.0/24 -m state --state ESTABLISHED -j ACCEPT
    sudo iptables -A INPUT -i eth0 -p tcp --sport 5671 -s 10.16.5.0/24 -m state --state NEW,ESTABLISHED -j ACCEPT
    sudo iptables -A OUTPUT -p tcp --dport 443 -d 10.16.5.0/24 -m state --state ESTABLISHED -j ACCEPT
    sudo iptables -A INPUT -i eth0 -p tcp --sport 443 -s 10.16.5.0/24 -m state --state NEW,ESTABLISHED -j ACCEPT
    
  • Last, we need to disable any other traffic to and from the Layer 4 device. Here the commands for that:
    sudo iptables -P INPUT DROP
    sudo iptables -P OUTPUT DROP
    sudo iptables -P FORWARD DROP
    

Don’t forget to save the configuration and restart the device with:

sudo netfilter-persistent save
sudo systemctl reboot

Deploying Modules on the Layer 4 Device

Similar to the deployment of modules on the Layer 5 Device, we need to deploy the standard $edgeAgent and $edgeHub modules as well as a registry module. For this layer, I experimented with the API proxy module. Hence, the configurations include also the $IoTEdgeApiProxy module. There is also an API Proxy configuration that you need to set on the $IoTEdgeApiProxy module’s twin – for details, take a look at Configure the API proxy module for your gateway hierarchy scenario. Here are the Gists that you can use for deploying the modules on the Layer 4 device:

By using the API proxy, you can also leverage the certificates deployed on the device to connect to the registry over HTTPS. Thus, you don’t need to configure Docker on the clients to connect to an insecured registry.

Testing the Registry Module

To test the registry module, use the following commands:

docker login -u <sync_token_name> -p <sync_token_password> 10.16.6.4:<port_if_any>
docker pull 10.16.6.4:<port_if_any>/azureiotedge-simulated-temperature-sensor:1.0

Now, we have set up the first nested layer (Layer 4) for our nested IoT Edge infrastructure. In the next post, I will describe the steps to set up the OT Proxy and the second nested layer for IoT Edge (Layer 3).

Image by ejaugsburg from Pixabay

In my last post, I started explaining how to configure the IoT Edge device hierarchy’s IT Proxy. This post will go one layer down and set up the Layer 5 device from the Purdue model for manufacturing networks.

Reconfiguring The Network

While implementing network segregation in the cloud is relatively easy, implementing it with a limited number of devices and a consumer-grade network switch requires a bit more design. In Azure, the routing between subnets within a VNet is automatically configured using the subnet gateways. In my case, though, the biggest challenge was connecting each Raspberry Pi device to two separate subnets using the available interfaces – one WiFi and one Ethernet. In essence, I couldn’t use the WiFi interface because then I was limited to my Eero’s (lack of) capabilities (Well, I have an idea how to make this work, but this will be a topic of a future post :)). Thus, my only option was to put all devices in the same subnet and play with the firewall on each device to restrict the traffic. Here is how the picture looks like.

To be able to connect to each individual device from my laptop (i.e. playing the role of the jumpbox from the Azure IoT Edge for Industrial IoT sample), I had to configure a second network interface on it and give it an IP address from the 10.16.0.0/16 network (the Workstation in the picture above). There are multiple ways to do that with easiest one to buy a USB networking dongle and connect it to the switch with the rest of the devices. One more thing that will be helpful to do to speed up the work is to edit the /etc/hosts file on my laptop and add DNS names for each of the devices:

10.16.8.4       pi-itproxy
10.16.7.4       pi-l5
10.16.6.4       pi-l4
10.16.5.4       pi-otproxy
10.16.4.4       pi-l3
10.16.3.4       pi-opcua

The next thing I had to do is to go back to the IT Proxy and change the subnet mask to enable a broader addressing range. Connect to the IT Proxy device that we configured in Configuring a hierarchy of IoT Edge Devices at Home Part 1 – Configuring the IT Proxy using ssh pi@10.16.8.4. Then edit the DHCP configuration file with sudo vi /etc/dhcpcd.conf and change the subnet mask from /24 to /16:

interface eth0
static ip_address=10.16.8.4/16

Now, the IT Proxy is configured to address the broader 10.16.0.0/16 network. To restrict the communication between the devices, we will configure each individual device’s firewall.

Now, getting back to the Layer 5 configuration. Normally, the Layer 5 device is configured to have access to the Internet via the IT Proxy as a gateway. Edit the DHCP configuration file with sudo vi /etc/dhcpcd.conf and add the following at the end:

interface eth0
static ip_address=10.16.7.4/16
static routers=10.16.8.4
static domain_name_servers=1.1.1.1

Note that I added the Cloudflare’s DNS server to the list of DNS servers. The reason for that is that the proxy device will not do DNS resolution. You can also configure it with your home network’s DNS server. This should be enough for now, and we can start installing the IoT Edge runtime on the Layer 5 device.

Testing from my laptop, I am able to connect to both devices using their 10.16.0.0/16 network IP addresses:

ssh pi@pi-itproxy

connects me to the IT Proxy, and:

ssh pi@pi-l5

connects me to the Layer 5 Iot Edge device.

Create the Azure Cloud Resources

Before we start installing the Azure IoT Edge runtime on the device, we need to create an Azure IoT Hub and register the L5 device with it. This is well described in Microsoft’s documentation explaining how to deploy code to a Linux device. Here are the Azure CLI commands that will create the resource group and the IoT Hub resources:

az group create --name tsm-rg-eus-test --location eastus
az iot hub create --resource-group tsm-rg-eus-test --name tsm-ioth-eus-test --sku S1 --partition-count 4

Next is to register the IoT Edge device and get the connection string. Here the commands:

az iot hub device-identity create --device-id L5-edge-pi --edge-enabled --hub-name tsm-ioth-eus-test
az iot hub device-identity connection-string show --device-id L5-edge-pi --hub-name tsm-ioth-eus-test

The last command will return the connection string that we should use to connect the new device to Azure IoT Hub. It has the following format:

{
  "connectionString": "HostName=tsm-ioth-eus-test.azure-devices.net;DeviceId=L5-edge-pi;SharedAccessKey=s0m#VeRYcRYpT1c$tR1nG"
}

Save the connection string because we will need it in the next section to configure the Layer 5 device.

Installing Azure IoT Edge Runtime on the Layer 5 Device

Installing the Azure IoT Edge is described in the Tutorial: Create a hierarchy of IoT Edge devices article. The tutorial describes how to build a hierarchy with two devices only. The important part of the nested configuration is to generate the certificates and transfer them to the devices. So, let’s go over this step by step for the Layer 5 device.

Create Certificates

The first thing we need to do on the workstation, after cloning the IoT Edge GitHub repository with

git clone https://github.com/Azure/iotedge.git

,is to generate the root and the intermediate certificates (check folder /tools/CACertificates):

./certGen.sh create_root_and_intermediate

Those certificates will be used to generate the individual devices’ certificates. For now, we will only create a certificate for the Layer 5 device.

./certGen.sh create_edge_device_ca_certificate "l5_certificate"

After those two command, you should have the following in your <WORKDIR>/certs folder:

drwxrwxrwx 1 toddysm toddysm 4096 Apr  8 15:04 .
drwxrwxrwx 1 toddysm toddysm 4096 Apr  8 15:04 ..
-rwxrwxrwx 1 toddysm toddysm 3960 Apr  8 14:54 azure-iot-test-only.intermediate-full-chain.cert.pem
-r-xr-xr-x 1 toddysm toddysm 1976 Apr  8 14:54 azure-iot-test-only.intermediate.cert.pem
-rwxrwxrwx 1 toddysm toddysm 5806 Apr  8 14:54 azure-iot-test-only.intermediate.cert.pfx
-r-xr-xr-x 1 toddysm toddysm 1984 Apr  8 14:54 azure-iot-test-only.root.ca.cert.pem
-rwxrwxrwx 1 toddysm toddysm 5891 Apr  8 15:04 iot-edge-device-ca-l5_certificate-full-chain.cert.pem
-r-xr-xr-x 1 toddysm toddysm 1931 Apr  8 15:04 iot-edge-device-ca-l5_certificate.cert.pem
-rwxrwxrwx 1 toddysm toddysm 7240 Apr  8 15:04 iot-edge-device-ca-l5_certificate.cert.pfx

The content of the <WORKDIR>/private folder should be the following:

drwxrwxrwx 1 toddysm toddysm 4096 Apr  8 15:04 .
drwxrwxrwx 1 toddysm toddysm 4096 Apr  8 15:04 ..
-r-xr-xr-x 1 toddysm toddysm 3326 Apr  8 14:54 azure-iot-test-only.intermediate.key.pem
-r-xr-xr-x 1 toddysm toddysm 3326 Apr  8 14:54 azure-iot-test-only.root.ca.key.pem
-r-xr-xr-x 1 toddysm toddysm 3243 Apr  8 15:04 iot-edge-device-ca-l5_certificate.key.pem

We need to upload the relevant certificates to the Layer 5 device. From the <WORKDIR> folder on the workstation issue the following commands:

scp ./certs/azure-iot-test-only.root.ca.cert.pem pi@pi-l5:.
scp ./certs/iot-edge-device-ca-l5_certificate-full-chain.cert.pem pi@pi-l5:.

The above two commands will upload the public key for the root certificate and the certificate chain to the Layer 5 device. The following command will upload the device’s private key:

scp ./private/iot-edge-device-ca-l5_certificate.key.pem pi@pi-l5:.

Connect to the Layer 5 device with ssh pi@pi-l5 and install the root CA using:

sudo cp ~/azure-iot-test-only.root.ca.cert.pem /usr/local/share/ca-certificates/azure-iot-test-only.root.ca.cert.pem.crt
sudo update-ca-certificates

The response from this command should be:

Updating certificates in /etc/ssl/certs...
1 added, 0 removed; done.
Running hooks in /etc/ca-certificates/update.d...
done.

To verify that the cert is installed, you can use:

ls /etc/ssl/certs/ | grep azure

We should also move the device certificate chain and the private key to the /var/secrets folder:

sudo mkdir /var/secrets
sudo mkdir /var/secrets/aziot
sudo mv iot-edge-device-ca-l5_certificate-full-chain.cert.pem /var/secrets/aziot/
sudo mv iot-edge-device-ca-l5_certificate.key.pem /var/secrets/aziot/

Configuring the IoT Edge Runtime

Installation and configuration of the IoT Edge runtime is described in the official Microsoft documentation (see Install or uninstall Azure IoT Edge for Linux ) but here the values you should set for the Layer 5 device configuration when editing the /etc/aziot/config.toml file:

hostname = "10.16.7.4"

trust_bundle_cert = "file:///etc/ssl/certs/azure-iot-test-only.root.ca.cert.pem.pem"

[provisioning]
source = "manual"
connection_string = "HostName=tsm-ioth-eus-test.azure-devices.net;DeviceId=L5-edge-pi;SharedAccessKey=s0m#VeRYcRYpT1c$tR1nG"

[agent.config]
image = "mcr.microsoft.com/azureiotedge-agent:1.2.0-rc4"
[edge_ca]
cert = "file:///var/secrets/aziot/iot-edge-device-ca-l5_certificate-full-chain.cert.pem"
pk = "file:///var/secrets/aziot/iot-edge-device-ca-l5_certificate.key.pem"

Apply the IoT Edge configuration with sudo iotedge config apply and check it with sudo iotedge check. At this point, the IoT Edge runtime should be running on the Layer 5 device. You can check the status with sudo iotedge system status.

Deploying Modules on the Layer 5 Device

The last thing we need to do is to deploy the required modules that will support the lower-layer device. In addition to the standard IoT Edge modules $edgeAgent and $edgeHub we need also to deploy a registry module.  The registry module is intended to serve the container images for the Layer 4 device. You should also deploy the API proxy module to enable a single endpoint for all services deployed on the IoT Edge device. Here are several Gists that you can use for deploying the modules on the Layer 5 device:

Testing the Registry Module

Before setting up the next layer of the hierarchical IoT edge infrastructure, we need to make sure that Layer 5 registry module is working properly. Without it, you will not be able to set up the next layer. From your laptop, you should be able to pull Docker images from the registry module. Depending on how you have deployed the registry module (using one of the Gists above), the pull commands should look something like this:

docker login -u <registry_username> -p <registry_password> 10.16.8.4:<port_if_any>
docker pull 10.16.8.4:<port_if_any>/azureiotedge-simulated-temperature-sensor:1.0

Now, we have set up the top layer (Layer 5) for our nested IoT Edge infrastructure. In the next post, I will describe the steps to set up the first nested layer of the hierarchical infrastructure – Layer 4.

Image by falco from Pixabay.