For a while, we’ve been exploring the idea of using OCI annotations to track the lifecycle of container images. The problem we are trying to solve is as follows. Container images are immutable and cannot be dynamically patched like virtual machines. To apply the latest updates to a containerized application, teams must produce a new image with the patches. Once the new image is produced, the old one should be considered outdated and vulnerable, and all workloads using the old image should be redeployed with the new one.

The problem is how to mark the old image as outdated. Why? Because teams may have pinned their deployments to a digest or an immutable tag, and we want them to move to the patched version. We also want to create policies that outdated images should not be deployed. Finally, we want an automated way to point the teams to the latest patched image. Unfortunately, using digests and tags prevents us from achieving those goals.

Container Image Lifecycle Example

Here is a concrete example. I will use semantic versioning for the tags to make the example more relatable.

  • First Revision: Build the application image ghcr.io/toddysm/flasksample:1.0
    1.0 in this example is a rolling tag. To differentiate between images, I may want to use an immutable tag; for example 1.0-20230707 using the YYYYMMDD format for the date. This image has a digest sha256@1234567890.
  • Second Revision: New vulnerabilities are found in ghcr.io/toddysm/flasksample:1.0 and I rebuild the image and tag it using the same 1.0 rolling tag.
    I also tag it with an immutable tag like 1.0-20230710. This image has a different digest sha256@5647382910.

At this moment, I have two images for the same application from the same lineage 1.0. Here is the relation between tags and digests:

  • Tags 1.0 and 1.0-20230710 point to digest sha256@5647382910
  • Tag 1.0-20230707 points to digest sha256@1234567890

If I have pinned my deployment to the tag 1.0-20230707 or the digest sha256@1234567890 I do not know whether the image is still fresh or has a newer patched version. I could try interpreting the tags, but this will be very custom to my tagging scheme. For example, my tagging scheme 1.0-YYYYMMDD differs from Python’s, NodeJS’, Alpine’s, or Ubuntu’s scheme (well, Ubuntu’s is very similar :)). Also, obtaining the tag from the image digest is not possible.

The idea is to use OCI annotations to add additional information to the images. This information can help us communicate the deprecation of images and preserve vital information lost when retagging. We are not the only ones thinking in this direction – the folks from Ubuntu also want to use annotations to deprecate images, although their goal is a bit different.

OCI Annotations for Image Lifecycle

OCI annotations are key-value pairs that you can add to the manifest of any OCI artifact, including container images. The problem is that annotations cannot be changed once the manifest is created. Hence, if you add the OCI annotations to the image manifest, you cannot update them anymore. The workaround to that is to use the OCI referrer capability and add a new artifact with annotations that is linked to the image. In essence, whenever you need to update an annotation, you must push a new artifact with the full set of annotations and link it to the image. The consumer of the annotations needs to list all referrer artifacts with the “annotation” type and take the latest one.

The other question that comes to mind is: “What annotations will help you track the image lifecycle?” OCI already specifies some pre-defined keys that you can leverage. There are three important ones that will help you manage the lifecycle:

  • org.opencontainers.image.created can be used to specify the date at which the image was created. For example, when the image is built.
  • org.opencontainers.image.version can be used to specify the version of the software. Think of this as the lineage of the software (i.e. Python 3.10 or Ubuntu Jammy).
  • org.opencontainers.image.revision can be used to sperify the patch version of the software. For example, Python 3.10.12 or Ubuntu Jammy 20230624.

One thing that OCI does not specify is an annotation for end-of-life of the image. For that, you can use custom annotation like vnd.myorganization.image.end-of-life.

How to Use Annotations for Image Lifecycle?

Let’s look at how the above annotations can be used to manage the lifecycle of a series of images. I will use concrete dates for the example.

First Revision of the Image: Build the application image ghcr.io/toddysm/flasksample:1.0. As part of the build process, add the following annotations to the image:

{
    "org.opencontainers.image.created" : "2023-07-07T00:00:00-08:00",
    "org.opencontainers.image.version" : "1.0",
    "org.opencontainers.image.revision" : "20230707"
}

Second Revision of the Image: Vulnerabilities are discovered in the ghcr.io/toddysm/flasksample:1.0 image. Rebuild the image with the fixes and add the following annotations to it:

{
    "org.opencontainers.image.created" : "2023-07-10T00:00:00-08:00",
    "org.opencontainers.image.version" : "1.0",
    "org.opencontainers.image.revision" : "20230710"
}

Find the previous image in the registry ghcr.io/toddysm/flasksample:1.0-20230707 and update the annotations to the following:

{
    "org.opencontainers.image.created" : "2023-07-07T00:00:00-08:00",
    "org.opencontainers.image.version" : "1.0",
    "org.opencontainers.image.revision" : "20230707",
    "vnd.myorganization.image.end-of-life" : "2023-07-10T00:00:00-08:00"
}

With those annotations in place, you can track the lifecycle of each image. But not only that! You can always determine the latest and most up-to-date image in the lineage by just pulling the 1.0 tag (which is available in the org.opencontainers.image.version annotation for every image in the lineage). Both the mutable and the immutable tags are preserved with the image. New annotations can always be added to the image by adding “empty” referrer artifacts with just the annotations. Also, annotations for the images are available even if the image is pulled by its digest.

Here are a couple of scenarios that you can implement with this information.

  • Block deployment of images that are end-of-life
    You can implement a policy to block the deployment of images that are end-of-life on your Kubernetes clusters. Such a policy can be implemented in admission controllers like Kyverno or Gatekeeper.
  • Suggest updated image in action items in vulnerability reports
    Current vulnerability reports for container images are hardly actionable because they do not provide an update path for the reported images. Development teams are not interested in how many vulnerabilities are discovered but in the quickest way to fix those.
  • Automate the process for rebuilding dependent images
    Tools like Dependabot can use the image lifecycle information to create pull requests for dependent images. This can speed up the process of fixing vulnerabilities and improving the vulnerability posture for the application.

All that sounds great, but the problem is the tooling support. While OCI specifies how you can store artifacts in registries and defines some standard annotations, very few, if any, tools allow you to easily achieve the above experience. I took it upon myself to give it a try!

Implementing Image Lifecycle Annotations with Existing Tools

Note: The below experience uses Docker buildx, ORAS, and GitHub Container Registry (GHCR) for the experience. The experience is as of July 10th, 2023, and may (or most certainly will) change. Other tools like regctl can also be used instead of ORAS. As always, I will use my cssc-pipeline repository for storing any code for this blog post.

First, I will set some environment variables to avoid retyping and make the commands easier to follow.

export TEMP_LOCATION=temp
export IMAGE_VERSION=1.0
export FIRST_REVISION=20230707
export SECOND_REVISION=20230710
export REGISTRY=ghcr.io/toddysm/cssc-pipeline
export REPOSITORY=flasksample
mkdir -p $TEMP_LOCATION

Building the First Revision of the Image

The first step is to build the first revision of the image. Docker buildx can build the image, but the default option is to save the image in Docker’s proprietary format, which doesn’t allow the use of annotations. To use annotations, you must use the OCI exporter and save the image as a tarball. Here is the command that will allow you to do that:

docker buildx build . -f Dockerfile \
  -t ${REGISTRY}/${REPOSITORY}:${IMAGE_VERSION} \
  -o "type=oci,dest=${TEMP_LOCATION}/flasksample-${IMAGE_VERSION}-${FIRST_REVISION}.tar,annotation.org.opencontainers.image.created=20230707T00:00-08:00,annotation.org.opencontainers.image.version=${IMAGE_VERSION},annotation.org.opencontainers.image.revision=${FIRST_REVISION}" \
  --metadata-file ${TEMP_LOCATION}/${REPOSITORY}-${IMAGE_VERSION}-${FIRST_REVISION}-metadata.json

The command above creates an image in OCI format and saves it as a tarball. We can use the generated metadata file to obtain the image’s digest.

export FIRST_REVISION_DIGEST=`cat ${TEMP_LOCATION}/${REPOSITORY}-${IMAGE_VERSION}-${FIRST_REVISION}-metadata.json \
  | jq -r '."containerimage.descriptor".digest'`
echo $FIRST_REVISION_DIGEST

In my case, the digest is sha256:1446094f076dcbc2b7e7943ae3806bb44003ee9e6c94efd3208b8f04159aa8c0.

Next, I will use the following ORAS command to push the OCI image to the GHCR registry:

oras cp --from-oci-layout ${TEMP_LOCATION}/${REPOSITORY}-${IMAGE_VERSION}-${FIRST_REVISION}.tar:${IMAGE_VERSION} \
  ${REGISTRY}/${REPOSITORY}:${IMAGE_VERSION}

I can verify that the annotations are set on the image by pulling the manifest and checking the annotations field:

oras manifest fetch ${REGISTRY}/${REPOSITORY}:${IMAGE_VERSION} \
  | jq .annotations

At this point, I have the first revision of the image built and published to GHCR under ghcr.io/toddysm/cssc-pipeline/flasksample:1.0.

Building the Second Revision of the Image

A few days later, if vulnerabilities are discovered in the image, I need to update the image with the latest patches. As part of the build process, I also need to update the annotations of the previous image revision.

The first thing I need to do is to obtain the digest of the first revision. Because the first revision is still tagged with 1.0, I can quickly get the digest using the following command:

export OLD_IMAGE_DIGEST=`oras manifest fetch --descriptor ${REGISTRY}/${REPOSITORY}:${IMAGE_VERSION} \
  | jq .digest | tr -d '"'`
echo $OLD_IMAGE_DIGEST

That command returns the same digest as before sha256:1446094f076dcbc2b7e7943ae3806bb44003ee9e6c94efd3208b8f04159aa8c0. Now, I have a unique reference to the first revision of the image. I can build the second revision of the image using the same commands as before.

# Build the second revision of the container image with annotations...
docker buildx build . -f Dockerfile \
  -t ${REGISTRY}/${REPOSITORY}:${IMAGE_VERSION} \
  -o "type=oci,dest=${TEMP_LOCATION}/${REPOSITORY}-${IMAGE_VERSION}-${SECOND_REVISION}.tar,annotation.org.opencontainers.image.created=20230710T00:00-08:00,annotation.org.opencontainers.image.version=${IMAGE_VERSION},annotation.org.opencontainers.image.revision=${SECOND_REVISION}" \
  --metadata-file ${TEMP_LOCATION}/${REPOSITORY}-${IMAGE_VERSION}-${SECOND_REVISION}-metadata.json

# Get the digest for the second revision...
export SECOND_REVISION_DIGEST=`cat ${TEMP_LOCATION}/${REPOSITORY}-${IMAGE_VERSION}-${SECOND_REVISION}-metadata.json \
  | jq -r '."containerimage.descriptor".digest'`
echo $SECOND_REVISION_DIGEST

# Use ORAS to push the second revision to the registry...
oras cp --from-oci-layout ${TEMP_LOCATION}/${REPOSITORY}-${IMAGE_VERSION}-${SECOND_REVISION}.tar:${IMAGE_VERSION} \
  ${REGISTRY}/${REPOSITORY}:${IMAGE_VERSION}

# Use ORAS to verify the annotations are set on the image...
oras manifest fetch ${REGISTRY}/${REPOSITORY}:${IMAGE_VERSION} \
  | jq .annotations

You can check that the second revision digest is different from the first revision using the following commands:

export IMAGE_DIGEST=`oras manifest fetch --descriptor ${REGISTRY}/${REPOSITORY}:${IMAGE_VERSION} \
  | jq .digest | tr -d '"'`
echo $IMAGE_DIGEST

For me, the digest of the second revision (or the most up-to-date image) is sha256:4ee61e3d9d28fe15cffc33854a2b851e2c87929a99f0c71bfc7a689ad372894d.

Updating the Lifecycle Annotations for the First Revision of the Image

This is the most important step of the process – I need to go back and update the lifecycle annotations for the first revision of the image and mark it as end-of-life. This is a bit trickier process because the manifest of the original image cannot be modified. It is immutable! I need to create a referrer artifact and store the lifecycle annotations in the manifest of this referrer artifact. However, the referrer artifact should be empty (well, you can put a cat picture there, but it is irrelevant 🙂 ) ORAS already supports push with an empty artifact. In the future, OCI-compliant registries will support empty layers for artifacts too. Here are the steps for that.

First, I will fetch the annotations for the first revision and update them with the end-of-life annotation.

oras manifest fetch ${REGISTRY}/${REPOSITORY}@${OLD_IMAGE_DIGEST} \
  | jq .annotations \
  | jq '. += {"vnd.myorganization.image.end-of-life":"20230710T00:00-08:00"}' \
  | jq '{"$manifest":.}' \
  > ${TEMP_LOCATION}/annotations.json

Note that ORAS uses a special JSON schema for annotation files. Hence, I needed to convert the annotations that I retrieved from the image manifest to a new JSON object expected by ORAS. Here is what the resulting file looks like:

jq . ${TEMP_LOCATION}/annotations.json                                                                         

{
  "$manifest": {
    "org.opencontainers.image.created": "20230707T00:00-08:00",
    "org.opencontainers.image.revision": "20230707",
    "org.opencontainers.image.version": "1.0",
    "vnd.myorganization.image.end-of-life": "20230710T00:00-08:00"
  }
}

Now, I need to push the empty artifact and refer to the first revision of the image. To do that, I will also need to use an artifact type (or mediaType in OCI language) to find my lifecycle annotations later on easily. There is no standard mediaType for that, so I have to invent my own – I will use application/vnd.myorganization.image.lifecycle.metadata. Here is the ORAS command that you can use to update the annotations:

oras attach --artifact-type application/vnd.myorganization.image.lifecycle.metadata \
  --annotation-file ${TEMP_LOCATION}/annotations.json \
  ${REGISTRY}/${REPOSITORY}@${OLD_IMAGE_DIGEST} \
  ${TEMP_LOCATION}/empty.layer

OK, I am done with setting the lifecycle annotations for the images.

Fetching the Lifecycle Annotations for Each Image

I can fetch the annotations for each image by simply using the following commands:

oras manifest fetch ${REGISTRY}/${REPOSITORY}@${OLD_IMAGE_DIGEST} | jq .annotations

oras manifest fetch ${REGISTRY}/${REPOSITORY}@${NEW_IMAGE_DIGEST} | jq .annotations

There is a problem, though! Those commands fetch the annotations that are set in the image manifest. For the outdated image (i.e. $OLD_IMAGE), the command will not return the end-of-life annotation. To get that annotation, I will need to fetch the referrer with the particular type application/vnd.myorganization.image.lifecycle.metadata. Here is how to do that.

First, I need to get the digest of the referrer artifact.

export ANNOTATIONS_ARTIFACT_DIGEST=`oras discover --artifact-type "application/vnd.myorganization.image.lifecycle.metadata" \
  ${REGISTRY}/${REPOSITORY}@${OLD_IMAGE_DIGEST} -o json \
  | jq '.manifests[0].digest' \
  | tr -d '"'`
echo $ANNOTATIONS_ARTIFACT_DIGEST

And then retrieve the annotations set in the referrer’s artifact manifest.

oras manifest fetch ${REGISTRY}/${REPOSITORY}@${ANNOTATIONS_ARTIFACT_DIGEST} | jq .annotations

As a logic for implementation, I would always check if the image has a referrer lifecycle artifact. If so, I will ignore the lifecycle annotations in the image manifest.

Closing Thoughts

Lifecycle annotations enable interesting scenarios for securing container supply chains and improving containerized applications’ vulnerability posture. Though the tooling can undoubtedly be improved – I had to do a lot of JSON conversions to get it working. The lack of standard annotation for end-of-life and mediaType for the referrer artifact makes the above solution very proprietary.

One issue that can arise is if multiple application/vnd.myorganization.image.lifecycle.metadata referrers are created. OCI doesn’t specify how to retrieve the latest artifact from registries by type. If multiple lifecycle annotations artifacts are pushed for the image, the client must pull and inspect each. This logic can be quite complex and can impact the performance on the client’s side.

An improvement that can be made to the process above is to sign each artifact (the image and the lifecycle annotations artifact). This will ensure that the annotations are trustable and not tampered with. Of course, an attacker can always remove the referrer artifact from the registry and leave the client with the impression that the image is still fresh.

Those are all food for thought and good topics for future posts. Here is also a video of the whole experience described above.

[UPDATE: 2023-03-26] When I wrote this post, the expectation was that OCI will release version 1.1 of the specification with artifact manifest included. This release was supposed to happen by end of Jan 2023 or mid Feb 2023. Unfortunately, the OCI 1.1 Image Spec PR 999 put a hold on that and as of today, the spec is not released. Although I promised to have a Part 2, due to the changes in the spec, continuing the investigation in the original direction may not be fruitful and helpful to anyone. Most of the functionality described below is removed from many registries and the steps and the information may be incorrect. The concepts are still relevant but their actual implementation may not be as described in this post. Consider the relevance of the information applicable only between Jan 5th 2023 and Jan 24th 2023 – the date the above PR was submitted. There will be no other updates to this post or Part 2 of the series. Instead of Part 2, folks may find the Registry & client support for Image Manifest type artifacts issue relevant to what they are looking for.

If you are deep into containers and software supply chain security, you may have heard of OCI referrers API and OCI artifacts. If not, but you are interested in the containers’ secure supply chain topic, this post will give you enough details to start exploring new registry capabilities that can significantly improve your software supply chain architecture.

This will be a two-part series. In the first part, I will examine the differences between OCI 1.0 and OCI 1.1 and their support across registries. In the second part, I will look at more advanced scenarios like deep hierarchies, deleting artifacts, and migrating content between registries with different support.

But before we start…

What is OCI?

The Open Container Initiative (OCI) is the governance organization responsible for creating open industry standards for container formats and runtimes. OCI develops and maintains three essential specifications:

  1. The OCI Image Format Specification defines the structure and the layout of an image or artifact. If you are interested in reading more about the OCI image layout, I recommend the No More Additional Network Requests – Enter: OCI Image Layout post from @developerguy. It will give you a good background on how the image is structured. I will mainly discuss the OCI Artifract Manifest in this post.
  2. The OCI Distribution Specification defines the APIs that registries should implement to enable the distribution of artifacts. The OCI Referrers API is part of this specification and will be discussed in this post.
  3. The OCI Runtime Specification specifies the configuration, the execution environment, and the lifecycle of a container.  I will not discuss the runtime specification in this post.

One additional note. You may have heard of the term OCI reference types in the past. This was the name of the working group (WG) responsible for driving the changes in the image format and distribution specification. The prototype implementation of reference types was first implemented in ORAS. Its usefulness was the reason it was brought to the attention of the OCI group and resulted in the new changes.

Disclaimer: One last thing I have to mention is that, at the time of this writing, the OCI specifications (OCI 1.1) that support the new artifact manifest changes and the referrers API is in release candidate 2 (RC.2). The release of the OCI 1.1 specifications is planned for February 2023. Keep in mind that not many registries support the new artifact manifest and the referrers API due to this fact. This post aims to describe the scenarios it enables and discuss the backward compatibilities with registries that support the current OCI 1.0 specifications. I will also test several registries and point out their current capabilities.

What Scenarios Do OCI Artifact Manifest and Referrers API Enable?

As always, I would like to start with the scenarios and what are the benefits of using those new capabilities. As part of the ongoing software secure supply chain efforts, every vendor must produce metadata in addition to the actual software. Vendors need to add human and machine-readable metadata describing the software, whether this is a binary executable or a container image. The most common metadata discussed nowadays is software bills of materials (SBOMs) and signatures. SBOMs list the packages and binaries used in the individual piece of software (aka the software “ingredients”). The signature is intended to testify about the authenticity of the software and prevent tampering with the bits.

In the past, container registries were intended to store only container images. With the introduction of OCI artifacts, container registries can store other artifacts like SBOMs, signatures, plain text files, and even videos. The OCI referrers API goes even further and allows you to establish relationships between artifacts. This is a compelling functionality that allows you to create structures like this:

+ Container Image
    - Signature of the Container Image
    + SBOM for the Container Image
        - Signature of the SBOM
    + Vulnerability Report for the Container Image
        - Signature of the Vulnerability Report
    + Additional Container Image metadata
        - Signature of the additional metadata
    - ...

Now, the container registry is not just a storage place for images but a generic artifacts storage that can also define relations between the artifacts. As you may have noticed the trend in the industries, the registries are not referred to as container registries anymore but as artifact registries.

There are many benefits that the new capabilities offer in addition to storing various artifacts:

  • Relevant artifacts can be stored and managed together with the subject (or primary) artifact.
    Querying and visualizing the related artifacts is much easier than storing them unrelated. This can result not only in more manageable implementations but also in better performance.
  • Relevant artifacts are easily discoverable.
    Pulling an image from a registry may require additional artifacts for verification. An example is signature verification before allowing deployment. Using the OCI referrers API to get an image’s signature will be a trivial and standardized operation.
  • Relevant artifacts can be copied together between registries.
    Content promotion between registries is a common scenario in container supply chains. Now, the image can be promoted to the target registry with all relevant artifacts instead of making many calls to the registry to discover them before promotion.

Because the capabilities are still new, how to standardize the implementations is still in discussion. You can look at my request for guidance for OCI artifacts for more variations of the above scenario and the possible implementations.

For this post, though, I will concentrate on a straightforward scenario using BOMs. I want to attach three different SBOMs to an image and test with a few major registries to understand the current capabilities. I will build the following content structure:

+ Container image
  artifactType: "application/vnd.docker.container.image.v1+json"
    - SPDX SBOM in JSON format
      artifactType: "application/spdx+json"
    - SPDX SBOM in TEXT format
      artifactType: "text/spdx"
    - CycloneDX SBOM in JSON format
      artifactType: "application/vnd.cyclonedx+json"

I also chose the following registries to test with:

You may not be familiar with the Zot and the ORAS registries listed above, but they are Open Source registries that you can run locally. Those registries are on top of any new OCI capabilities and one of the first registries to implement those. They make it a good option for testing new OCI capabilities.

Now, let’s dive into the registry capabilities.

Creating the Artifacts

All artifacts and results can be found in my container secure supply chain playground repository on GitHub. I have created the usual flasksample image and generated the SBOMs using Syft. Here are all the commands for that:

# Buld and push the image
docker build -t toddysm/flasksample:oci1.1-tests .
docker login -u toddysm
docker push toddysm/flasksample:oci1.1-tests

# Generate the SBOM in various formats
syft packages toddysm/flasksample:oci1.1-tests -o spdx-json > oci1.1-tests.spdx.json
syft packages toddysm/flasksample:oci1.1-tests -o spdx > oci1.1-tests.spdx
syft packages toddysm/flasksample:oci1.1-tests -o cyclonedx-json > oci1.1-tests.cyclonedx.json

I will use the above image and the generated SBOMs to push to various registries and test their behavior. Note that ORAS CLI can handle registries that support the new OCI 1.1 specifications and registries that support only OCI 1.0 specifications. ORAS CLI automatically converts the manifest to the most appropriate manifest based on the registry support.

Referring to Artifacts in Registries with OCI 1.0 Support

Docker Hub recently announced support for OCI Artifacts. Note, though, that this is support for OCI 1.0. Here are the commands to push the SBOMs to Docker Hub and reference the image as a subject:

oras attach --artifact-type "application/spdx+json" --annotation "producer=syft 0.63.0" docker.io/toddysm/flasksample:oci1.1-tests ./oci1.1-tests.spdx.json
# Command reponse
Uploading e6011f4dd3fa oci1.1-tests.spdx.json
Uploaded  e6011f4dd3fa oci1.1-tests.spdx.json
Attached to docker.io/toddysm/flasksample@sha256:b89e2098603bead4f07e318e1a4e11b4a4ef1f3614725c88b3fcdd469d55c0e0
Digest: sha256:0a1dd8fcdef54eb489aaa99978e19cffd7f6ae11595322ab5af694913da177d4

oras attach --artifact-type "text/spdx" --annotation "producer=syft 0.63.0" docker.io/toddysm/flasksample:oci1.1-tests ./oci1.1-tests.spdx
# Command response
Uploading d9c2135fe4b9 oci1.1-tests.spdx
Uploaded  d9c2135fe4b9 oci1.1-tests.spdx
Error: DELETE "https://registry-1.docker.io/v2/toddysm/flasksample/manifests/sha256:16a58d1ed78402935d61e524f5609087334b164861618373d7b96a7b7c612f1a": response status code 405: unsupported: The operation is unsupported.

oras attach --artifact-type "application/vnd.cyclonedx+json" --annotation "producer=syft 0.63.0" docker.io/toddysm/flasksample:oci1.1-tests ./oci1.1-tests.cyclonedx.json
# Command response
Uploading c0ddc2a5ea78 oci1.1-tests.cyclonedx.json
Uploaded  c0ddc2a5ea78 oci1.1-tests.cyclonedx.json
Error: DELETE "https://registry-1.docker.io/v2/toddysm/flasksample/manifests/sha256:37ebfdebe499bcec8e5a5ce04ae4526d3e560c199c16a85a97f80a91fbf1d2c3": response status code 405: unsupported: The operation is unsupported.

Checking Docker Hub, I can see that the image digest is sha256:b89e2098603bead4f07e318e1a4e11b4a4ef1f3614725c88b3fcdd469d55c0e0 as returned by the ORAS CLI above.

I expected to see another artifact with sha256:0a1dd8fcdef54eb489aaa99978e19cffd7f6ae11595322ab5af694913da177d4 (again returned by the ORAS CLI above). However, such an artifact is not shown in the Docker Hub UI. There is another artifact tagged with the digest of the image.

However, the digest of that artifact (sha256:c8c7d53f0e1ed5553a815c7b5ccf40c09801f7636a3c64940eafeb7bfab728cd) is not the one from the ORAS CLI output.

Of course, the question in my mind is: “What is the digest that ORAS CLI returned above?” The sha256:0a1dd8fcdef54eb489aaa99978e19cffd7f6ae11595322ab5af694913da177d4 one. Using ORAS CLI or crane, I can explore the various manifests.

What Manifests Are Created When Referring Between Artifacts in OCI 1.0 Registries?

The oras discover command helps visualize the hierarchy of artifacts that reference a subject.

oras discover docker.io/toddysm/flasksample:oci1.1-tests -o tree                      
docker.io/toddysm/flasksample:oci1.1-tests
├── application/spdx+json
│   └── sha256:0a1dd8fcdef54eb489aaa99978e19cffd7f6ae11595322ab5af694913da177d4
├── text/spdx
│   └── sha256:6f6c9260247ad876626f742508550665ad20c75ac7e4469782d18e47d40cac67
└── application/vnd.cyclonedx+json
    └── sha256:047054894cbe7c9e57532f4e01d03f631e92c3aec48b4a06485296aee1374b3b

According to the output above, I should be able to see four artifacts. Also, as you can see, the digest sha256:0a1dd8fcdef54eb489aaa99978e19cffd7f6ae11595322ab5af694913da177d4 is the one for the first SBOM I attached to the image. To understand what is happening, let’s look at the different manifests. I will use the oras manifest command to pull the manifests of all artifacts by referencing them by digest:

# Pull the manifest for the image
oras manifest fetch docker.io/toddysm/flasksample@sha256:b89e2098603bead4f07e318e1a4e11b4a4ef1f3614725c88b3fcdd469d55c0e0 > manifest-sha256-b89e2098603bead4f07e318e1a4e11b4a4ef1f3614725c88b3fcdd469d55c0e0.json

# Pull the manifest for the SPDX SBOM in JSON format
oras manifest fetch docker.io/toddysm/flasksample@sha256:0a1dd8fcdef54eb489aaa99978e19cffd7f6ae11595322ab5af694913da177d4 > manifest-sha256-0a1dd8fcdef54eb489aaa99978e19cffd7f6ae11595322ab5af694913da177d4.json

# Pull the manifest for the SPDX SBOM in TEXT format
oras manifest fetch docker.io/toddysm/flasksample@sha256:6f6c9260247ad876626f742508550665ad20c75ac7e4469782d18e47d40cac67 > manifest-sha256-6f6c9260247ad876626f742508550665ad20c75ac7e4469782d18e47d40cac67.json

# Pull the manifest for the CycloneDX SBOM in JSON format
oras manifest fetch docker.io/toddysm/flasksample@sha256:047054894cbe7c9e57532f4e01d03f631e92c3aec48b4a06485296aee1374b3b > manifest-sha256-047054894cbe7c9e57532f4e01d03f631e92c3aec48b4a06485296aee1374b3b.json

# Pull the manifest of the artifact tagged with the image digest
oras manifest fetch docker.io/toddysm/flasksample@sha256:c8c7d53f0e1ed5553a815c7b5ccf40c09801f7636a3c64940eafeb7bfab728cd > manifest-sha256-c8c7d53f0e1ed5553a815c7b5ccf40c09801f7636a3c64940eafeb7bfab728cd.json

All manifests are available in the dockerhub folder in my container secure supply chain playground repository on GitHub. The image manifest is self-explanatory and I will not dig into it. The other four are more interesting. Opening the manifest for the SPDX SBOM in JSON format, I can see that it is an artifact manifest "mediaType": "application/vnd.oci.artifact.manifest.v1+json" of type "artifactType": "application/spdx+json". It has a blob annotated with the name of the file I pushed. It also has a subject field referring to the image. The manifest for the SPDX SBOM in TEXT format and the CycloneDX SBOM in JSON format have the same structure. The hierarchy represented by the oras discover command above shows exactly those manifests. I believe the output of oras discover could be improved to show also the image digest for completeness:

oras discover docker.io/toddysm/flasksample:oci1.1-tests -o tree                      
docker.io/toddysm/flasksample:oci1.1-tests
│   └── sha256:b89e2098603bead4f07e318e1a4e11b4a4ef1f3614725c88b3fcdd469d55c0e0 
├── application/spdx+json
│   └── sha256:0a1dd8fcdef54eb489aaa99978e19cffd7f6ae11595322ab5af694913da177d4
├── text/spdx
│   └── sha256:6f6c9260247ad876626f742508550665ad20c75ac7e4469782d18e47d40cac67
└── application/vnd.cyclonedx+json
    └── sha256:047054894cbe7c9e57532f4e01d03f631e92c3aec48b4a06485296aee1374b3b

The question remains how the manifest tagged with the image digest plays a role here. Looking at it, I can see that it is an index manifest "mediaType": "application/vnd.oci.image.index.v1+json" that lists the three SBOM artifacts I pushed. Remember, this index manifest is tagged with the image digest. This is similar to the structure Sigstore creates that I described in Implementing Containers’ Secure Supply Chain with Sigstore Part 2 – The Magic Behind. Here is a visual of how the manifests are related:

The SBOM artifacts are not visible in the Docker Hub UI because they are not tagged, and Docker Hub has no UI to show untagged artifacts. A few? questions remain:

  • What happens if I delete the image?
  • What happens if I delete the index manifest?
  • Can I create deeper hierarchical structures in registries that support OCI 1.0?
  • What happens when I copy related artifacts from OCI 1.0 registry to OCI 1.1 registry?

I will come back to those in the second part of the series. Before that, I would like to examine how registries with OCI 1.1 support storing the manifests for the referred artifacts.

Referring to Artifacts in Registries with OCI 1.1 Support

Azure Container Registry (ACR) just announced support for OCI 1.1. It is in Public Preview and supports the OCI 1.1 RC spec at the moment of this writing. After retagging the image, the commands for pushing the SBOMs are similar to the ones used for Docker Hub.

# Re-tag and push the image
docker image tag toddysm/flasksample:oci1.1-tests tsmacrwcusocitest.azurecr.io/flasksample:oci1.1-tests
docker push tsmacrwcusocitest.azurecr.io/flasksample:oci1.1-tests

oras attach --artifact-type "application/spdx+json" --annotation "producer=syft 0.63.0" tsmacrwcusocitest.azurecr.io/flasksample:oci1.1-tests ./oci1.1-tests.spdx.json
# Command response
Uploading e6011f4dd3fa oci1.1-tests.spdx.json
Uploaded  e6011f4dd3fa oci1.1-tests.spdx.json
Attached to tsmacrwcusocitest.azurecr.io/flasksample@sha256:b89e2098603bead4f07e318e1a4e11b4a4ef1f3614725c88b3fcdd469d55c0e0
Digest: sha256:71e90130cb912fbcff6556c0395878a8e7a0c7244eb8e8ee9001e84f9cba804a

oras attach --artifact-type "text/spdx" --annotation "producer=syft 0.63.0" tsmacrwcusocitest.azurecr.io/flasksample:oci1.1-tests ./oci1.1-tests.spdx
# Command response
Uploading d9c2135fe4b9 oci1.1-tests.spdx
Uploaded  d9c2135fe4b9 oci1.1-tests.spdx
Attached to tsmacrwcusocitest.azurecr.io/flasksample@sha256:b89e2098603bead4f07e318e1a4e11b4a4ef1f3614725c88b3fcdd469d55c0e0
Digest: sha256:0fbd0e611ec9fe620b72ebe130da680de9402e1e241b30c2aa4515610ed2d766

oras attach --artifact-type "application/vnd.cyclonedx+json" --annotation "producer=syft 0.63.0" tsmacrwcusocitest.azurecr.io/flasksample:oci1.1-tests ./oci1.1-tests.cyclonedx.json
# Command response
Uploading c0ddc2a5ea78 oci1.1-tests.cyclonedx.json
Uploaded  c0ddc2a5ea78 oci1.1-tests.cyclonedx.json
Attached to tsmacrwcusocitest.azurecr.io/flasksample@sha256:b89e2098603bead4f07e318e1a4e11b4a4ef1f3614725c88b3fcdd469d55c0e0
Digest: sha256:198e405344b5fafd6127821970eb4a84129ae729402e8c2c71fc1bb80abf0954

Azure portal does not show any additional artifacts and manifests, as shown in this screenshot:

This is a bit confusing, as I would at least expect to see a few more manifests. The distribution specification does not define functionality for listing untagged manifests; figuring out those dependencies without additional information will be hard. One noticeable thing is that no additional index manifest is tagged with the image digest.

oras discover command returns the following tree:

oras discover tsmacrwcusocitest.azurecr.io/flasksample:oci1.1-tests -o tree
tsmacrwcusocitest.azurecr.io/flasksample:oci1.1-tests
├── application/vnd.cyclonedx+json
│   └── sha256:198e405344b5fafd6127821970eb4a84129ae729402e8c2c71fc1bb80abf0954
├── text/spdx
│   └── sha256:0fbd0e611ec9fe620b72ebe130da680de9402e1e241b30c2aa4515610ed2d766
└── application/spdx+json
    └── sha256:71e90130cb912fbcff6556c0395878a8e7a0c7244eb8e8ee9001e84f9cba804a

This is the same structure I saw above when using the command on the Docker Hub image. Pulling the manifests reveals that they are precisely the same as the ones from Docker Hub.

# Pull the manifest for the image
oras manifest fetch tsmacrwcusocitest.azurecr.io/flasksample@sha256:b89e2098603bead4f07e318e1a4e11b4a4ef1f3614725c88b3fcdd469d55c0e0 > manifest-sha256-b89e2098603bead4f07e318e1a4e11b4a4ef1f3614725c88b3fcdd469d55c0e0.json

# Pull the manifest for the SPDX SBOM in JSON format
oras manifest fetch tsmacrwcusocitest.azurecr.io/flasksample@sha256:71e90130cb912fbcff6556c0395878a8e7a0c7244eb8e8ee9001e84f9cba804a > manifest-sha-71e90130cb912fbcff6556c0395878a8e7a0c7244eb8e8ee9001e84f9cba804a.json

# Pull the manifest for the SPDX SBOM in TEXT format
oras manifest fetch tsmacrwcusocitest.azurecr.io/flasksample@sha256:0fbd0e611ec9fe620b72ebe130da680de9402e1e241b30c2aa4515610ed2d766 > manifest-sha-0fbd0e611ec9fe620b72ebe130da680de9402e1e241b30c2aa4515610ed2d766.json

# Pull the manifest for the CycloneDX SBOM in JSON format
oras manifest fetch tsmacrwcusocitest.azurecr.io/flasksample@sha256:198e405344b5fafd6127821970eb4a84129ae729402e8c2c71fc1bb80abf0954 > manifest-sha256-198e405344b5fafd6127821970eb4a84129ae729402e8c2c71fc1bb80abf0954.json

All manifests are available in the acr folder in my container secure supply chain playground repository on GitHub.

Luckily, ACR has CLI commands to list the manifests. Those commands call ACR’s proprietary APIs to gather the information. There are two ACR CLI commands I can use to list the manifests for a repository: acr manifest list and acr manifest metadata list . At the time of this writing acr manifest list had a bug and couldn’t list the OCI artifact manifests. acr manifest list-metadata worked fine and I could list all manifests in the repository. The output from the acr manifest list-metadata command is available here. From the output, I can see that only four manifests were created. There is no manifest index that points to the three artifacts. Here is a visual of how the manifests are related in an OCI 1.1 compliant registry:

To summarize the differences between the OCI 1.0 and OCI 1.1 referrers’ implementation:

  • In OCI 1.0 compliant registries, you will see an additional index manifest that is tagged with the image digest
  • In OCI 1.0 compliant registries, the index manifest lists the artifacts related to the image
  • In OCI 1.0 compliant registries, the artifact manifests still refer to the image using the subject field

I will look at how this impacts the content in your registry in the second part of this series.

Referrers Support Across Registries

Here is a table that shows the current (as of Jan 5th, 2023) support in registries.

The manifests and the debug logs are available in the corresponding registry folders in the cssc-pipeline repository on GitHub. You can refer to those for details.

Note: The investigation is done using the ORAS tool – the only one I am aware of that can create references between artifacts at the time of this writing. It may be possible to craft manifests manually and push them to the registries failing with ORAS.

Learnings

In addition to the above, I learned a few more things while experimenting with different registries.

  • As far as I know, OCI does not specify an API to list untagged manifests in a registry. This can be a problem because the attached artifacts do not have tags but only digests. I am pretty sure I ended up with some orphaned artifacts in the registries that do not support artifact referrers. Unfortunately, I cannot be sure due to the lack of such an API.
  • Registries are inconsistent in their responses when the capabilities are not supported. In my opinion, there is a lack of feedback on what capabilities each registry supports, which makes it hard for the clients. An easy way to check the capabilities of a registry would be beneficial.

In the next post of the series, I will go over more advanced scenarios like promotion between registries and building deeper hierarchies.

Photo by Petrebels on Unsplash

In the last post of the series about Sigstore, I will look at the most exciting part of the implementation – ephemeral keys, or what the Sigstore team calls keyless signing. The post will go over the second and third scenarios I outlined in Implementing Containers’ Secure Supply Chain with Sigstore Part 1 – Signing with Existing Keys and go deeper into the experience of validating artifacts and moving artifacts between registries.

Using Sigstore to Sign with Ephemeral Keys

Using Cosign to sign with ephemeral keys is still an experimental feature and will be released in v1.14.0 (see the following PR). Signing with ephemeral keys is relatively easy.

$ COSIGN_EXPERIMENTAL=1 cosign sign 562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample:v1
Generating ephemeral keys...
Retrieving signed certificate...
 Note that there may be personally identifiable information associated with this signed artifact.
 This may include the email address associated with the account with which you authenticate.
 This information will be used for signing this artifact and will be stored in public transparency logs and cannot be removed later.
 By typing 'y', you attest that you grant (or have permission to grant) and agree to have this information stored permanently in transparency logs.
Are you sure you want to continue? (y/[N]): y
Your browser will now be opened to:
https://oauth2.sigstore.dev/auth/auth?access_type=online&client_id=sigstore&code_challenge=e16i62r65TuJiklImxYFIr32yEsA74fSlCXYv550DAg&code_challenge_method=S256&nonce=2G9cB5h89SqGwYQG2ey5ODeaxO8&redirect_uri=http%3A%2F%2Flocalhost%3A33791%2Fauth%2Fcallback&response_type=code&scope=openid+email&state=2G9cB7iQ7BSXYQdKKe6xGOY2Rk8
Successfully verified SCT...
Warning: Tag used in reference to identify the image. Consider supplying the digest for immutability.
"562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample" appears to be a private repository, please confirm uploading to the transparency log at "https://rekor.sigstore.dev" [Y/N]: y
tlog entry created with index: 5133131
Pushing signature to: 562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample

You are sent to authenticate using OpenID Connect (OIDC) via the browser. I used my GitHub account to authenticate.

Once authenticated, you are redirected back to localhost, where Cosign reads the code query string parameter from the URL and verifies the authentication.

Here is what the redirect URL looks like.

http://localhost:43219/auth/callback?code=z6dghpnzujzxn6xmfltyl6esa&state=2G9dbwwf9zCutX3mNevKWVd87wS

I have also pushed v2 and v3 of the image to the registry and signed them using the approach above. Here is the new state in my registry.

wdt_ID Artifact Tag Artifact Type Artifact Digest
1 v1 Image sha256:9bd049b6b470118cc6a02d58595b86107407c9e288c0d556ce342ea8acbafdf4
2 sha256-9bd049b6b470118cc6a02d58595b86107407c9e288c0d556ce342ea8acbafdf4.sig Signature sha256:483f2a30b765c3f7c48fcc93a7a6eb86051b590b78029a59b5c2d00e97281241
3 v2 Image sha256:d4d59b7e1eb7c55b0811c3dfd3571ab386afbe6d46dfcf83e06343e04ae888cb
4 sha256-d4d59b7e1eb7c55b0811c3dfd3571ab386afbe6d46dfcf83e06343e04ae888cb.sig Signature sha256:8c43d1944b4d0c3f0d7d6505ff4d8c93971ebf38fc60157264f957e3532d8fd7
5 v3 Image sha256:2e19bd9d9fb13c356c64c02c574241c978199bfa75fd0f46b62748f59fb84f0a
6 sha256:2e19bd9d9fb13c356c64c02c574241c978199bfa75fd0f46b62748f59fb84f0a.sig Signature sha256:cc2a674776dfe5f3e55f497080e7284a5bd14485cbdcf956ba3cf2b2eebc915f

If you look at the console output, you will also see that one of the lines mentions tlog in it. This is the index in Rekor transaction log where the signature’s receipt is stored. For the three signatures that I created, the indexes are:

5133131 for v1
5133528 for v2
and 5133614 for v3

That is it! I have signed my images with ephemeral keys, and I have the tlog entries that correspond to the signatures. It is a fast and easy experience.

Verifying Images Signed With Ephemeral Keys

Verifying the images signed with ephemeral keys is built into the Cosign CLI.

$ COSIGN_EXPERIMENTAL=1 cosign verify 562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample:v1 | jq . > flasksample-v1-ephemeral-verification.json
Verification for 562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample:v1 --
The following checks were performed on each of these signatures:
- The cosign claims were validated
- Existence of the claims in the transparency log was verified offline
- Any certificates were verified against the Fulcio roots.

The outputs from the verification of flasksample:v1, flasksample:v2, and flasksample:v3 are available on GitHub. Few things to note about the output from the verification.

  • The output JSON contains the logIndexas well as the logID, which I did assume I could use to search for the receipts in Rekor. I have some confusion about the logID purpose, but I will go into that a little later!
  • There is a body field that I assume is the actual signature. This JSON field is not yet documented and is hard to know with such a generic name.
  • The type field seems to be a free text field. I would expect it to be something more structured and the values to come from a list of possible and, most importantly, standardized types.

Search and Explore Rekor

The goal of my second scenario – Sign Container Images with Ephemeral Keys from Fulcio is not only to sign images with ephemeral keys but also to invalidate one of the signed artifacts. Unfortunately, documentation and the help output from the commands are scarce. Also, searching on Google how to invalidate a signature in Rekor yields no results. I decided to start exploring the Rekor logs to see if that may help.

There aren’t many commands that you can use in Rekor. The four things you can do are: get records; search by email, SHA or artifact; uploadentry or artifact; and verify entry or artifact. Using the information from the outputs in the previous section, I can get the entries for the three images I signed using the log indexes.

$ rekor-cli get --log-index 5133131 > flasksample-v1-ephemeral-logentry.json
$ rekor-cli get --log-index 5133528 > flasksample-v2-ephemeral-logentry.json
$ rekor-cli get --log-index 5133614 > flasksample-v3-ephemeral-logentry.json

The outputs from the above commands for flasksample:v1, flasksample:v2, and flasksample:v3 are available on GitHub.

I first noted that the log entries are not returned in JSON format by the Rekor CLI. This is different from what Cosign returns and is a bit inconsistent. Second, the log entries outputted by the Rekor CLI are not the same as the verification outputs returned by Cosign. Cosign verification output provides different information than the Rekor log entry. This begs the question: “How does Cosign get this information?” First, though, let’s see what else Rekor can give me.

I can use Rekor search to find all the log entries that I created. This will include the ones for the three images above and, theoretically, everything else I signed.

$ rekor-cli search --email toddysm_dev1@outlook.com
Found matching entries (listed by UUID):
24296fb24b8ad77aaf485c1d70f4ab76518483d5c7b822cf7b0c59e5aef0e032fb5ff4148d936353
24296fb24b8ad77a3f43ac62c8c7bab7c95951d898f2909855d949ca728ffd3426db12ff55390847
24296fb24b8ad77ac2334dfe2759c88459eb450a739f08f6a16f5fd275431fb42c693974af3d5576
24296fb24b8ad77a8f14877c718e228e315c14f3416dfffa8d5d6ef87ecc4f02f6e7ce5b1d5b4e95
24296fb24b8ad77a6828c6f9141b8ad38a3dca4787ab096dca59d0ba68ff881d6019f10cc346b660
24296fb24b8ad77ad54d6e9bb140780477d8beaf9d0134a45cf2ded6d64e4f0d687e5f30e0bb8c65
24296fb24b8ad77a888dc5890ac4f99fc863d3b39d067db651bf3324674b85a62e3be85066776310
24296fb24b8ad77a47fae5af8718673a2ef951aaf8042277a69e808f8f59b598d804757edab6a294
24296fb24b8ad77a7155046f33fdc71ce4e291388ef621d3b945e563cb29c2e3cd6f14b9ba1b3227
24296fb24b8ad77a5fc1952295b69ca8d6f59a0a7cbfbd30163c3a3c3a294c218f9e00c79652d476

Note that the result lists UUIDs that are different from the logID properties in the verification output JSON. You can get log entries using the UUID or the logIndex but not using the logID. The UUIDs are not present in the Cosign output mentioned in the previous section, while the logID is. However, it is unclear what the logID can be used for and why the UUID is not included in the Cosign output.

Rekor search command supposedly allows you to search by artifact and SHA. However, it is not documented what form those need to take. Using the image name or the image SHA yield no results.

$ rekor-cli search --artifact 562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample
Error: invalid argument "562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample" for "--artifact" flag: Key: '' Error:Field validation for '' failed on the 'url' tag
$ rekor-cli search --sha sha256:9bd049b6b470118cc6a02d58595b86107407c9e288c0d556ce342ea8acbafdf4
no matching entries found
$ rekor-cli search --sha 9bd049b6b470118cc6a02d58595b86107407c9e288c0d556ce342ea8acbafdf4
no matching entries found

I think the above are the core search scenarios for container images (and other artifacts), but it seems they are either not implemented or not documented. Neither the Rekor GitHub repository, the Rekor public documentation, nor the Rekor Swagger have any more details on the search. I filed an issue for Rekor to ask how the artifacts search works.

Coming back to the main goal of invalidating a signed artifact, I couldn’t find any documentation on how to do that. The only apparent options to invalidate the artifacts are either uploading something to Rekor or removing the signature from Rekor. I looked at all options to upload entries or artifacts to Rekor, but the documentation mainly describes how to sign and upload entries using other types like SSH, X509, etc. It does seem to me that there is no capability in Rekor to say: “This artifact is not valid anymore”.

I thought that looking at how Rekor verifies signatures may help me understand the approach.

Verifying Signatures Using Rekor CLI

I decided to explore how the signatures are verified and reverse engineer the process to understand if an artifact signature can be invalidated. Rekor CLI has a verify command. My assumption was that Rekor’s verify command worked the same as the Cosign verify command. Unfortunately, that is not the case.

$ rekor-cli verify --artifact 562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample:v1
Error: invalid argument "562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample:v1" for "--artifact" flag: Key: '' Error:Field validation for '' failed on the 'url' tag
$ rekor-cli verify --entry 24296fb24b8ad77a8f14877c718e228e315c14f3416dfffa8d5d6ef87ecc4f02f6e7ce5b1d5b4e95
Error: invalid argument "24296fb24b8ad77a8f14877c718e228e315c14f3416dfffa8d5d6ef87ecc4f02f6e7ce5b1d5b4e95" for "--entry" flag: Key: '' Error:Field validation for '' failed on the 'url' tag

Unfortunately, due to a lack of documentation and examples, I wasn’t able to figure out how this worked without browsing the code. While that kind of digging is always an option, I would expect an easier experience as an end user.

I was made aware of the following blog post, though. It describes how to handle account compromise. To put it in context, if my GitHub account is compromised, this blog post describes the steps I need to take to invalidate the artifacts. I do have two problems with this proposal:

  1. As you remember, in my scenario, I wanted to invalidate only the flasksample:v2 artifact, and not all artifacts signed with my account. If I follow the proposal in the blog post, I will invalidate everything signed with my GitHub account, which may result in outages.
  2. The proposal relies on the consumer of artifacts to constantly monitor the news for what is valid and what is not; which GitHub account is compromised and which one is not. This is unrealistic and puts too much manual burden on the consumer of artifacts. In an ideal scenario, I would expect the technology to solve this with a proactive way to notify the users if something is wrong rather than expect them to learn reactively.

At this point in time, I will call this scenario incomplete. Yes, I am able to sign with ephemeral keys, but this doesn’t seem unique in this situation. The ease around the key generation is what they seem to be calling attention to, and it does make signing much less intimidating to new users, but I could still generate a new SSH or GPG key every time I need to sign something. Trusting Fulcio’s root does not automatically increase my security – I would even argue the opposite. Making it easier for everybody to sign does not increase security, either. Let’s Encrypt already proved that. While Let’s Encrypt made an enormous contribution to our privacy and helped secure every small business site, the ease, and accessibility with which it works means that every malicious site now also has a certificate. The lock in the address bar is no longer a sign of security. We are all excited about the benefits, but I bet very few of us are also excited for this to help the bad guys. We need to think beyond the simple signing and ensure that the whole end-to-end experience is secure.

I will move to the last scenario now.

Promoting Sigstore Signed Images Between Registries

In the last scenario I wanted to test the promotion of images between registries. Let’s create a v4 of the image and sign it using an ephemeral key. Here are the commands with the omitted output.

$ docker build -t 562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample:v4 .
$ docker push 562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample:v4
$ COSIGN_EXPERIMENTAL=1 cosign sign 562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample:v4

The Rekor log index for the signature is 5253114. I can use Crane to copy the image and the signature from AWS ECR into Azure ACR.

$ crane copy 562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample:v4 tsmacrtestcssc.azurecr.io/flasksample:v4
$ crane copy 562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample:sha256-aa2690ed4a407ac8152d24017eb6955b01cbb0fc44afe170dadedc30da80640a.sig tsmacrtestcssc.azurecr.io/flasksample:sha256-aa2690ed4a407ac8152d24017eb6955b01cbb0fc44afe170dadedc30da80640a.sig

Also, let’s validate the ephemeral key signature using the image in Azure ACR.

$ COSIGN_EXPERIMENTAL=1 cosign verify tsmacrtestcssc.azurecr.io/flasksample:v4 | jq .
Verification for tsmacrtestcssc.azurecr.io/flasksample:v4 --
The following checks were performed on each of these signatures:
 - The cosign claims were validated
 - Existence of the claims in the transparency log was verified offline
 - Any certificates were verified against the Fulcio roots.

Next, I will sign the image with a key stored in Azure Key Vault and verify the signature.

$ cosign sign --key azurekms://tsm-kv-usw3-tst-cssc.vault.azure.net/sigstore-azure-test-key-ec tsmacrtestcssc.azurecr.io/flasksample:v4
Warning: Tag used in reference to identify the image. Consider supplying the digest for immutability.
Pushing signature to: tsmacrtestcssc.azurecr.io/flasksample
$ cosign verify --key azurekms://tsm-kv-usw3-tst-cssc.vault.azure.net/sigstore-azure-test-key-ec tsmacrtestcssc.azurecr.io/flasksample:v4
Verification for tsmacrtestcssc.azurecr.io/flasksample:v4 --
The following checks were performed on each of these signatures:
 - The cosign claims were validated
 - The signatures were verified against the specified public key
[{"critical":{"identity":{"docker-reference":"tsmacrtestcssc.azurecr.io/flasksample"},"image":{"docker-manifest-digest":"sha256:aa2690ed4a407ac8152d24017eb6955b01cbb0fc44afe170dadedc30da80640a"},"type":"cosign container image signature"},"optional":null}]

Everything worked as expected. This scenario was very smooth, and I was able to complete it in less than a minute.

Summary

So far, I have just scratched the surface of what the Sigstore project could accomplish. While going through the scenarios in these posts, I had a bunch of other thoughts, so I wanted to highlight a few below:

  • Sigstore is built on a good idea to leverage ephemeral keys for signing container images (and other software). However, just the ephemeral keys alone do not provide higher security if there is no better process to invalidate the signed artifacts. With traditional X509 certificates, one can use CRL (Certificate Revocation Lists) or OCSP (Online Certificate Status Protocol) to revoke certificates. Although they are critiqued a lot, the process of invalidating artifacts using ephemeral keys and Sigstore does not seem like an improvement at the moment. I look forward to the improvements in this area as further discussions happen.
  • Sigstore, like nearly all open-source projects, would benefit greatly from better documentation and consistency in the implementation. Inconsistent messages, undocumented features, myriad JSON schemas, multiple identifiers used for different purposes, variable naming conventions in JSONs, and unpredictable output from the command line tools are just a few things that can be improved. I understand that some of the implementation was driven by requirements to work with legacy registries but going forward, that can be simplified by using OCI references. The bigger the project grows, the harder it will become to fix those.
  • The experience that Cosign offers is what makes the project successful. Signing and verifying images using the legacy X.509 and the ephemeral keys is easy. Hiding the complexity behind a simple CLI is a great strategy to get adoption.

I tested Sigstore a year ago and asked myself the question: “How do I solve the SolarWinds exploit with Sigstore?” Unfortunately, Sigstore doesn’t make it easier to solve that problem yet. Having in mind my experience above, I would expect a lot of changes in the future as Sigstore matures.

Unfortunately, there is no viable alternative to Sigstore on the market today. Notary v1 (or Docker Content Trust) proved not flexible enough. Notary v2 is still in the works and has yet to show what it can do. However, the lack of alternatives does not automatically mean that we avoid the due diligence required for a security product of such importance.  Sigstore has had a great start, and this series proves to me that we’ve got a lot of work ahead of us as an industry to solve our software supply chain problems.

Today, the secure supply chain for software is on top of mind for every CISO and enterprise leader. After the President’s Executive Order (EO), many efforts were spun off to secure the supply chain. One of the most prominent is, of course, Sigstore. I looked at Sigstore more than a year ago and was excited about the idea of ephemeral keys. I thought it might solve some common problems with signing. Like, for example, reducing the blast radius if a signing key is compromised or signing identity is stolen.

Over the past twelve months, I’ve spent a lot of time working on a secure supply chain for containers at Microsoft and gained a deep knowledge of the use cases and myriad of scenarios. At the same time, Sigstore gained popularity, and more and more companies started using it to secure their container supply chains. I’ve followed the project development and the growth in popularity. In recent weeks, I decided to take another deep look at the technology and evaluate how it will perform against a few core scenarios to secure container images against supply chain attacks.

This will be a three-part series going over the Sigstore experience for signing containers. In the first part, I will look at the experience of signing with existing long-lived keys as well as adding attestations like SBOMs and SLSA provenance documents. In the second part, I will go deeper into the artifacts created during the signing and reverse-engineer their purpose. In the third part, I will look at the signing experience with short-lived keys as well as promoting signatures between registries.

Before that, though, let’s look at some scenarios that I will use to guide my experiment.

Containers’ Supply Chain Scenarios

Every technology implementation (I believe) should start with user scenarios. Signing container images is not a complete scenario but a part of a larger experience. Below are the experiences that I would like to test as part of my experiment. Also, I will do this using the top two cloud vendors – AWS and Azure.

Sign Container Images With Existing Keys Stored in a KMS

In this scenario, I will sign the images with keys that are already stored in my cloud key management systems (ASKW KMS or Azure Key Vault). The goal here is to enable enterprises to use existing keys and key management infrastructure for signing. Many enterprises already use this legacy signing scenario for their software, so there is nothing revolutionary here except the additional artifacts.

  1. Build a v1 of a test image
  2. Push the v1 of the test image to a registry
  3. Sign the image with a key stored in a cloud KMS
  4. Generate an SBOM for the container image
  5. Sign the SBOM and push it to the registry
  6. Generate an SLSA provenance attestation
  7. Sign the SLSA provenance attestation and push it to the registry
  8. Pull and validate the SBOM
  9. Pull and validate the SLSA provenance attestation

A note! I will cheat with the SLSA provenance attestations because the SLSA tooling works better in CI/CD pipelines than with manual Docker build commands that I will use for my experiment.

Sign Container Images with Ephemeral Keys from Fulcio

In this scenario, I will test how the signing with ephemeral keys (what Sigstore calls keyless signing) improves the security of the containers’ supply chain. Keyless signing is a bit misleading because keys are still involved in generating the signature. The difference is that the keys are generated on-demand by Fulcio and have a short lifespan (I believe 10 min or so). I will not generate SBOMs and SLSA provenance attestations for this second scenario, but you can assume that this may also be part of it in a real-life application. Here is what I will do:

  1. Build a v1 of a test image
  2. Push the v1 of the test image to a registry
  3. Sign the image with an ephemeral key
  4. Build a v2 of the test image and repeat steps 2 and 3 for it
  5. Build a v3 of the test image and repeat steps 2 and 3 for it
  6. Invalidate the signature for v2 of the test image

The premise of this scenario is to test a temporary exploit of the pipeline. This is what happened with SolarWinds Supply Chain Compromise, and I would like to understand how we might be able to use Sigstore to prevent such an attack in the future or how it could reduce the blast radius. I don’t want to invalidate the signatures for v1 and v3 because this will be similar to the traditional signing approach with long-lived keys.

Acquire OSS Container Image and Re-Sign for Internal Use

This is a common scenario that I’ve heard from many customers. They import images from public registries, verify them, scan them, and then want to re-sign them with internal keys before allowing them for use. So, here is what I will do:

  1. Build an image
  2. Push it to the registry
  3. Sign it with an ephemeral key
  4. Import the image and the signature from one registry (ECR) into another (ACR)
    Those steps will simulate importing an image signed with an ephemeral key from an OSS registry like Docker Hub or GitHub Container Registry.
  5. Sign the image with a key from the cloud KMS
  6. Validate the signature with the cloud KMS certificate

Let’s get started with the experience.

Environment Set Up

To run the commands below, you will need to have AWS and Azure accounts. I have already created container registries and set up asymmetric keys for signing in both cloud vendors. I will not go over the steps for setting those up – you can follow the vendor’s documentation for that. I have also set up AWS and Azure CLIs so I can sign into the registries, run other commands against the registries and retrieve the keys from the command line. Once again, you can follow the vendor’s documentation to do that. Now, let’s go over the steps to set up Sigstore tooling.

Installing Sigstore Tooling

To go over the scenarios above, I will need to install the Cosign and Rekor CLIs. Cosign is used to sign the images and also interacts with Fulcio to obtain the ephemeral keys for signing. Rekor is the transparency log that keeps a record of the signatures done by Cosign using ephemeral keys.

When setting up automation for either signing or signature verification, you will need to install Cosign only as a tool. If you need to add or retrieve Rekor records that are not related to signing or attestation, you will need to install Rekor CLI.

You have several options to install Cosign CLI; however, the only documented option to install Rekor CLI is using Golang or building from source (for which you need Golang). One note: the installation instructions for all Sigstore tools are geared toward Golang developers.

The next thing is that on the Sigstore documentation site, I couldn’t find information on how to verify that the Cosign binaries I installed were the ones that Sigstore team produced. And the last thing that I noticed after installing the CLIs is the details I got about the binaries. Running cosign version and rekor-cli version gives the following output.

$ cosign version
  ______   ______        _______. __    _______ .__   __.
 /      | /  __  \      /       ||  |  /  _____||  \ |  |
|  ,----'|  |  |  |    |   (----`|  | |  |  __  |   \|  |
|  |     |  |  |  |     \   \    |  | |  | |_ | |  . `  |
|  `----.|  `--'  | .----)   |   |  | |  |__| | |  |\   |
 \______| \______/  |_______/    |__|  \______| |__| \__|
cosign: A tool for Container Signing, Verification and Storage in an OCI registry.

GitVersion:    1.13.0
GitCommit:     6b9820a68e861c91d07b1d0414d150411b60111f
GitTreeState:  "clean"
BuildDate:     2022-10-07T04:37:47Z
GoVersion:     go1.19.2
Compiler:      gc
Platform:      linux/amd64Sigstore documentation site
$ rekor-cli version
  ____    _____   _  __   ___    ____             ____   _       ___
 |  _ \  | ____| | |/ /  / _ \  |  _ \           / ___| | |     |_ _|
 | |_) | |  _|   | ' /  | | | | | |_) |  _____  | |     | |      | |
 |  _ <  | |___  | . \  | |_| | |  _ <  |_____| | |___  | |___   | |
 |_| \_\ |_____| |_|\_\  \___/  |_| \_\          \____| |_____| |___|
rekor-cli: Rekor CLI

GitVersion:    v0.12.2
GitCommit:     unknown
GitTreeState:  unknown
BuildDate:     unknown
GoVersion:     go1.18.2
Compiler:      gc
Platform:      linux/amd64

Cosign CLI provides details about the build of the binary, Rekor CLI does not. Using the above process to install the binaries may seem insecure, but this seems to be by design, as explained in Sigstore Issue #2300: Verify the binary downloads when installing from .deb (or any other binary release).

Here is the catch, though! I looked at the above experience as a novice user going through the Sigstore documentation. Of course, as with any other technical documentation, this one is incomplete and not updated with the implementation. There is no documentation on how to verify the Cosign binary, but there is one describing how to verify Rekor binaries. If you go to the Sigstore Github organization and specifically to the Cosign and Rekor release pages, you will see that they’ve published the signatures and the SBOMs for both tools. You will also find binaries for Rekor that you can download. So you can verify the signature of the release binaries before installing. Here is what I did for Rekor CLI version that I had downloaded:

$ COSIGN_EXPERIMENTAL=1 cosign verify-blob \
    --cert https://github.com/sigstore/rekor/releases/download/v0.12.2/rekor-cli-linux-amd64-keyless.pem \
    --signature https://github.com/sigstore/rekor/releases/download/v0.12.2/rekor-cli-linux-amd64-keyless.sig \
    https://github.com/sigstore/rekor/releases/download/v0.12.2/rekor-cli-linux-amd64

tlog entry verified with uuid: 38665ab8dc42600de87ed9374e86c83ac0d7d11f1a3d1eaf709a8ba0d9a7e781 index: 4228293
Verified OK

Verifying the Cosign binary is trickier, though, because you need to have Cosign already installed to verify it. Here is the output if you already have Cosign installed and you want to move to a newer version:

$ COSIGN_EXPERIMENTAL=1 cosign verify-blob \
    --cert https://github.com/sigstore/cosign/releases/download/v1.13.0/cosign-linux-amd64-keyless.pem 
    --signature https://github.com/sigstore/cosign/releases/download/v1.13.0/cosign-linux-amd64-keyless.sig 
    https://github.com/sigstore/cosign/releases/download/v1.13.0/cosign-linux-amd64

tlog entry verified with uuid: 6f1153edcc399b22b016709a218127fc7d5e9fb7071cd4812a9847bf13f65190 index: 4639787
Verified OK

If you are installing Cosign for the first time and downloading the binaries from the release page, you can follow a process similar to the one for verifying Rekor releases. I have submitted an issue to update the Cosign documentation with release verification instructions.

I would rate the installation experience no worse than any other tool geared toward hardcore engineers.

Let’s get into the scenarios.

Using Cosign to Sign Container Images with a KMS Key

Here are the two images that I will use for the first scenario:

$ docker images
REPOSITORY                                                 TAG       IMAGE ID       CREATED         SIZE
562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample   v1        b40ba874cb57   2 minutes ago   138MB
tsmacrtestcssc.azurecr.io/flasksample                      v1        b40ba874cb57   2 minutes ago   138MB

Using Cosign With a Key Stored in AWS KMS

Let’s go over the AWS experience first.

# Sign into the registry
$ aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 562077019569.dkr.ecr.us-west-2.amazonaws.com
Login Succeeded

# And push the image after that
$ docker push 562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample:v1

Signing the container image with the AWS key was relatively easy. Though, be careful when you omit the host and make sure you add that third backslash; otherwise, you will get errors. Here is what I got on the first attempt, which puzzled me a little.

$ cosign sign --key awskms://61c124fb-bf47-4f95-a805-65dda7cd08ae 562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample:v1
Error: signing [562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample:v1]: getting signer: reading key: kms get: kms specification should be in the format awskms://[ENDPOINT]/[ID/ALIAS/ARN] (endpoint optional)
main.go:62: error during command execution: signing [562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample:v1]: getting signer: reading key: kms get: kms specification should be in the format awskms://[ENDPOINT]/[ID/ALIAS/ARN] (endpoint optional)

$ cosign sign --key awskms://arn:aws:kms:us-west-2:562077019569:key/61c124fb-bf47-4f95-a805-65dda7cd08ae 562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample:v1
Warning: Tag used in reference to identify the image. Consider supplying the digest for immutability.
Error: signing [562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample:v1]: recursively signing: signing digest: getting fetching default hash function: getting public key: operation error KMS: GetPublicKey, failed to parse endpoint URL: parse "https://arn:aws:kms:us-west-2:562077019569:key": invalid port ":key" after host
main.go:62: error during command execution: signing [562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample:v1]: recursively signing: signing digest: getting fetching default hash function: getting public key: operation error KMS: GetPublicKey, failed to parse endpoint URL: parse "https://arn:aws:kms:us-west-2:562077019569:key": invalid port ":key" after host

Of course, when I typed the URIs correctly, the image was signed, and the signature got pushed to the registry.

$ cosign sign --key awskms:///61c124fb-bf47-4f95-a805-65dda7cd08ae 562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample:v1
Warning: Tag used in reference to identify the image. Consider supplying the digest for immutability.
Pushing signature to: 562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample

Interestingly, I didn’t get the tag warning when using the Key ID incorrectly. I got it when I used the ARN incorrectly as well as when I used the Key ID correctly. Also, I struggled to interpret the error messages, which made me wonder about the consistency of the implementation, but I will cover more about that in the conclusions.

One nice thing was that I was able to copy the Key ID and the Key ARN and directly paste them into the URI without modification. Unfortunately, this was not the case with Azure Key Vault 🙁 .

Using Cosign to Sign Container Images With Azure Key Vault Key

According to the Cosign documentation, I had to set three environment variables to use keys stored in Azure Key Vault. It looks as if service principal is the only authentication option that Cosign implemented. So, I created one and gave it all the necessary permissions to Key Vault. I’ve also set the required environment variables with the service principal credentials.

As I hinted above, my first attempt to sign with a key stored in Azure Key Vault failed. Unlike the AWS experience, copying the key identifier from the Azure Portal and pasting it into the URI (without the https:// part) won’t do the job.

$ cosign sign --key azurekms://tsm-kv-usw3-tst-cssc.vault.azure.net/keys/sigstore-azure-test-key/91ca3fb133614790a51fc9c04bd96890 tsmacrtestcssc.azurecr.io/flasksample:v1
Error: signing [tsmacrtestcssc.azurecr.io/flasksample:v1]: getting signer: reading key: kms get: kms specification should be in the format azurekms://[VAULT_NAME][VAULT_URL]/[KEY_NAME]
main.go:62: error during command execution: signing [tsmacrtestcssc.azurecr.io/flasksample:v1]: getting signer: reading key: kms get: kms specification should be in the format azurekms://[VAULT_NAME][VAULT_URL]/[KEY_NAME]

If you decipher the help text that you get from the error message: kms specification should be in the format azurekms://[VAULT_NAME][VAULT_URL]/[KEY_NAME], you would assume that there are two ways to construct the URI:

  1. Using the key vault name and the key name like this
    azurekms://tsm-kv-usw3-tst-cssc/sigstore-azure-test-key
    The assumption is that Cosign automatically appends .vault.azure.net at the end.
  2. Using the key vault hostname (not URL or identifier) and the key name like this
    azurekms://tsm-kv-usw3-tst-cssc.vault.azure.net/sigstore-azure-test-key

The first one just hung for minutes and did not complete. I’ve tried it several times, but the behavior was consistent.

$ cosign sign --key azurekms://tsm-kv-usw3-tst-cssc/sigstore-azure-test-key tsmacrtestcssc.azurecr.io/flasksample:v1
Warning: Tag used in reference to identify the image. Consider supplying the digest for immutability.
^C
$

I assume the problem is that it tries to connect to a host named tsm-kv-usw3-tst-cssc , but it seems that it was not timing out. The hostname one brought me a step further. It seems that the call to Azure Key Vault was made, and I got the following error:

$ cosign sign --key azurekms://tsm-kv-usw3-tst-cssc.vault.azure.net/sigstore-azure-test-key tsmacrtestcssc.azurecr.io/flasksample:v1
Warning: Tag used in reference to identify the image. Consider supplying the digest for immutability.
Error: signing [tsmacrtestcssc.azurecr.io/flasksample:v1]: recursively signing: signing digest: signing the payload: keyvault.BaseClient#Sign: Failure responding to request: StatusCode=403 -- Original Error: autorest/azure: Service returned an error. Status=403 Code="Forbidden" Message="The user, group or application 'appid=04b07795-xxxx-xxxx-xxxx-02f9e1bf7b46;oid=f4650a81-f57d-4fb3-870c-e84fe859f68a;numgroups=1;iss=https://sts.windows.net/08c1c649-bfdd-439e-8e5b-5ff31c72ce4e/' does not have keys sign permission on key vault 'tsm-kv-usw3-tst-cssc;location=westus3'. For help resolving this issue, please see https://go.microsoft.com/fwlink/?linkid=2125287" InnerError={"code":"ForbiddenByPolicy"}
main.go:62: error during command execution: signing [tsmacrtestcssc.azurecr.io/flasksample:v1]: recursively signing: signing digest: signing the payload: keyvault.BaseClient#Sign: Failure responding to request: StatusCode=403 -- Original Error: autorest/azure: Service returned an error. Status=403 Code="Forbidden" Message="The user, group or application 'appid=04b07795-8ddb-461a-bbee-02f9e1bf7b46;oid=f4650a81-f57d-4fb3-870c-e84fe859f68a;numgroups=1;iss=https://sts.windows.net/08c1c649-bfdd-439e-8e5b-5ff31c72ce4e/' does not have keys sign permission on key vault 'tsm-kv-usw3-tst-cssc;location=westus3'. For help resolving this issue, please see https://go.microsoft.com/fwlink/?linkid=2125287" InnerError={"code":"ForbiddenByPolicy"}

Now, this was a very surprising error. And mainly because the AppId from the error message (04b07795-xxxx-xxxx-xxxx-02f9e1bf7b46) didn’t match the AppId (or Client ID) of the environment variable that I have set as per the Cosign documentation.

$ echo $AZURE_CLIENT_ID
a59eaa16-xxxx-xxxx-xxxx-dca100533b89

Note that I masked parts of the IDs for privacy reasons.

My first assumption was that the AppId from the error message was for my user account, with which I signed in using Azure CLI. This assumption turned out to be true. Not knowing the intended behavior, I filed an issue for the Sigstore team to clarify and document the Azure Key Vault authentication behavior. After restarting the terminal (it seems to restart is the norm in today’s software products 😉 ), I was able to move another step forward. Now, having only signed in with the service principal credentials, I got the following error:

$ cosign sign --key azurekms://tsm-kv-usw3-tst-cssc.vault.azure.net/sigstore-azure-test-key tsmacrtestcssc.azurecr.io/flasksample:v1
Warning: Tag used in reference to identify the image. Consider supplying the digest for immutability.
Error: signing [tsmacrtestcssc.azurecr.io/flasksample:v1]: recursively signing: signing digest: signing the payload: keyvault.BaseClient#Sign: Failure responding to request: StatusCode=400 -- Original Error: autorest/azure: Service returned an error. Status=400 Code="BadParameter" Message="Key and signing algorithm are incompatible. Key https://tsm-kv-usw3-tst-cssc.vault.azure.net/keys/sigstore-azure-test-key/91ca3fb133614790a51fc9c04bd96890 is of type 'RSA', and algorithm 'ES256' can only be used with a key of type 'EC' or 'EC-HSM'."
main.go:62: error during command execution: signing [tsmacrtestcssc.azurecr.io/flasksample:v1]: recursively signing: signing digest: signing the payload: keyvault.BaseClient#Sign: Failure responding to request: StatusCode=400 -- Original Error: autorest/azure: Service returned an error. Status=400 Code="BadParameter" Message="Key and signing algorithm are incompatible. Key https://tsm-kv-usw3-tst-cssc.vault.azure.net/keys/sigstore-azure-test-key/91ca3fb133614790a51fc9c04bd96890 is of type 'RSA', and algorithm 'ES256' can only be used with a key of type 'EC' or 'EC-HSM'."

Apparently, I have generated an incompatible key! Note that RSA keys are not supported by Cosign, as I documented in the following Sigstore documentation issue. After generating a new key, the signing finally succeeded.

$ cosign sign --key azurekms://tsm-kv-usw3-tst-cssc.vault.azure.net/sigstore-azure-test-key-ec tsmacrtestcssc.azurecr.io/flasksample:v1
Warning: Tag used in reference to identify the image. Consider supplying the digest for immutability.
Pushing signature to: tsmacrtestcssc.azurecr.io/flasksample

OK! I was able to get through the first three steps of Scenario 1: Sign Container Images With Existing Keys from KMS. Next, I will add some other artifacts to the image – aka attestations. I will use only one of the cloud vendors for that because I don’t expect differences in the experience.

Adding SBOM Attestation With Cosign

Using Syft, I can generate an SBOM for the container image that I have built. Then I can use Cosign to sign and push the SBOM to the registry. Keep in mind that you need to be signed into the registry to generate the SBOM. Below are the steps to generate the SBOM (nothing to do with Cosign). The SBOM generated is also available in my Github test repo.

# Sign into AWS ERC
$ aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 562077019569.dkr.ecr.us-west-2.amazonaws.com

# Generate the SBOM
$ syft packages 562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample:v1 -o spdx-json > flasksample-v1.spdx

Cosign CLI’s help shows the following message how to add an attestation to an image using AWS KMS key.

cosign attest --predicate <FILE> --type <TYPE> --key awskms://[ENDPOINT]/[ID/ALIAS/ARN] <IMAGE>

When I was running this test, there was no explanation of what the --type <TYPE> parameter was. I decided just to give it a try.

$ cosign attest --predicate flasksample-v1.spdx --type sbom --key awskms:///arn:aws:kms:us-west-2:562077019569:key/61c124fb-bf47-4f95-a805-65dda7cd08ae 562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample:v1
Error: signing 562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample:v1: invalid predicate type: sbom
main.go:62: error during command execution: signing 562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample:v1: invalid predicate type: sbom

Trying spdx-json as a type also doesn’t work. There were a couple of places here and here, where Cosign documentation spoke about custom predicate types, but none of the examples showed how to use the parameter. I decided to give it one last try.

$ cosign attest --predicate flasksample-v1.spdx --type "cosign.sigstore.dev/attestation/v1" --key awskms:///arn:aws:kms:us-west-2:562077019569:key/61c124fb-bf47-4f95-a805-65dda7cd08ae 562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample:v1
Error: signing 562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample:v1: invalid predicate type: cosign.sigstore.dev/attestation/v1
main.go:62: error during command execution: signing 562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample:v1: invalid predicate type: cosign.sigstore.dev/attestation/v1

Obviously, this was not yet documented, and it was not clear what values could be provided for it. Here the issue asking to clarify the purpose of the --type <TYPE> parameter. From the documentation examples, it seemed that this parameter could be safely omitted. So, I gave it a shot! Running the command without the parameter worked fine and pushed the attestation to the registry.

$ cosign attest --predicate flasksample-v1.spdx --key awskms:///arn:aws:kms:us-west-2:562077019569:key/61c124fb-bf47-4f95-a805-65dda7cd08ae 562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample:v1
Using payload from: flasksample-v1.spdx

One thing that I noticed with the attestation experience is that it pushed a single artifact with .att at the end of the tag. I will come back to this in the next post. Now, let’s push the SLSA attestation for this image.

Adding SLSA Attestation With Cosign

As I mentioned above, I will cheat with the SLSA attestation because I do all those steps manually and docker build doesn’t generate SLSA provenance. I will use this sample for the SLSA provenance attestation.

$ cosign attest --predicate flasksample-v1.slsa --key awskms:///arn:aws:kms:us-west-2:562077019569:key/61c124fb-bf47-4f95-a805-65dda7cd08ae 562077019569.dkr.ecr.us-west-2.amazonaws.com/flasksample:v1
Using payload from: flasksample-v1.slsa

Cosign did something, as we can see on the console as well as in the registry – the digest of the .att artifact changed.

The question, though, is what exactly happened?

In the next post of the series, I will go into detail about what is happening behind the scenes, where I will look deeper at the artifacts created by Cosign.

Summary

To summarize my experience so far, here is what I think.

  • As I mentioned above, the installation experience for the tools is no worse than any other tool targeted to engineers. Improvements in the documentation would be beneficial for the first-use experience, and I filed a few issues to help with that.
  • Signing with a key stored in AWS was easy and smooth. Unfortunately, the implementation followed the same pattern for Azure Key Vault. I think it would be better to follow the patterns for the specific cloud vendor. There is no expectation that each cloud vendor will follow the same naming, URI, etc. patterns; changing those may result in more errors than benefits for the user.
  • While Cosign hides a lot of the complexities behind the scenes, providing some visibility into what is happening will be good. For example, if you create a key in Azure Key Vault, Cosign CLI will automatically create a key that it supports. That will avoid the issue I encountered with the RSA keys, but it may not be the main scenario used in the enterprise.

Next time, I will spend some time looking at the artifacts created by Cosign and understanding their purpose, as well as how to verify those using Cosign and the keys stored in the KMS.

In my last post, Implementing Quarantine Pattern for Container Images, I wrote about how to implement a quarantine pattern for container images and how to use policies to prevent the deployment of an image that doesn’t meet certain criteria. In that post, I also mentioned that the quarantine flag (not to be confused with the quarantine pattern 🙂) has certain disadvantages. Since then, Steve Lasker has convinced me that the quarantine flag could be useful in certain scenarios. Many of those scenarios are new and will play a role in the containers’ secure supply chain improvements. Before we look at the scenarios, let’s revisit how the quarantine flag works.

What is the Container Image Quarantine Flag?

As you remember from the previous post, the quarantine flag is set on an image at the time the image is pushed to the registry. The expected workflow is shown in the flow diagram below.

The quarantine flag is set on the image for as long as the Quarantine Processor completes the actions and removes the image from quarantine. We will go into detail about what those actions can be later on in the post. The important thing to remember is that, while in quarantine, the image can be pulled only by the Quarantine Processor. Neither the Publisher nor the Consumer or other actor should be able to pull the image from the registry while in quarantine. The way this is achieved is through special permissions that are assigned to the Quarantine Processor that the other actors do not have. Such permissions can be quarantine pull, and quarantine push, and should allow pulling artifacts from and pushing artifacts to the registry while the image is in quarantine.

Inside the registry, you will have a mix of images that are in quarantine and images that are not. The quarantined ones can only be pulled by the Quarantine Processor, while others can be pulled by anybody who has access to the registry.

Quarantining images is a capability that needs to be implemented in the container registry. Though this is not a standard capability, very few, if any, container registries implement it. Azure Container Registry (ACR) has a quarantine feature that is in preview. As explained in the previous post, the quarantine flag’s limitations are still valid. Mainly, those are:

  • If you need to have more than one Quarantine Processor, you need to figure out a way to synchronize their operations. The Quarantine Processor who completes the last action should remove the quarantine flag.
  • Using asynchronous processing is hard to manage. The Quarantine Processor manages all the actions and changes the flag. If you have an action that requires asynchronous processing, the Quarantine Processor needs to wait for the action to complete to evaluate the result and change the flag.
  • Last, you should not set the quarantine flag once you remove it. If you do that, you may break a lot of functionality and bring down your workloads. The problem is that you do not have the granularity of control over who can and cannot pull the image except for giving them the Quarantine Processor role.

With all that said, though, if you have a single Quarantine Processor, the quarantine flag can be used to prepare the image for use. This can be very helpful in the secure supply chain scenarios for containers, where the CI/CD pipelines do not only push the images to the registries but also produce additional artifacts related to the images. Let’s look at a new build scenario for container images that you may want to implement.

Quarantining Images in the CI/CD Pipeline

The one place where the quarantine flag can prove useful is in the CI/CD pipeline used to produce a compliant image. Let’s assume that for an enterprise, a compliant image is one that is signed, has an SBOM that is also signed, and passed a vulnerability scan with no CRITICAL or HIGH severity vulnerabilities. Here is the example pipeline that you may want to implement.

In this case, the CI/CD agent is the one that plays the Quarantine Processor role and manages the quarantine flag. As you can see, the quarantine flag is automatically set in step 4 when the image is pushed. Steps 5, 6, 7, and 8 are the different actions performed on the image while it is in quarantine. While those actions are not complete, the image should not be pullable by any consumer. For example, some of those actions, like the vulnerability scan, may take a long time to complete. You don’t want a developer to accidentally pull the image before the vulnerability scan is done. If one of those actions fails for any reason, the image should stay in quarantine as non-compliant.

Protecting developers from pulling non-compliant images is just one of the scenarios that a quarantine flag can help with. Another one is avoiding triggers for workflows that are known to fail if the image is not compliant.

Using Events to Trigger Image Workflows

Almost every container registry has an eventing mechanism that allows you to trigger workflows based on events in the registry. Typically, you would use the image push event to trigger the deployment of your image for testing or production. In the above case, if your enterprise has a policy for only deploying images with signatures, SBOMs, and vulnerability reports, your deployment will fail if the deployment is triggered right after step 4. The deployment should be triggered after step 9, which will ensure that all the required actions on the image are performed before the deployment starts.

To avoid the triggering of the deployment, the image push event should be delayed till after step 9. A separate event quarantine push can be emitted in step 4 that can be used to trigger actions related to the quarantine of the image. Note of caution here, though! As we mentioned previously, synchronizing multiple actors who can act on the quarantine flag can be tricky. If the CI/CD pipeline is your Quarantine Processor, you may feel tempted to use the quarantine push event to trigger some other workflow or long-running action. An example of such action can be an asynchronous malware scanning and detonation action, which cannot be run as part of the CI/CD pipeline. The things to be aware of are:

  • To be able to pull the image, the malware scanner must also have the Quarantine Processor role assigned. This means that you will have more than one concurrent Quarantine Processor acting on the image.
  • The Quarantine Processor that finishes first will remove the quarantine flag or needs to wait for all other Quarantine Processors to complete. This, of course, adds complexity to managing the concurrency and various race conditions.

I would strongly suggest that you have only one Quarantine Processor and use it to manage all activities from it. Else, you can end up with inconsistent states of the images that do not meet your compliance criteria.

When Should Events be Fired?

We already mentioned in the previous section the various events you may need to implement in the registry:

  • A quarantine push event is used to trigger workflows that are related to images in quarantine.
  • An image push event is the standard event triggered when an image is pushed to the registry.

Here is a flow diagram of how those events should be fired.

This flow offers a logical sequence of events that can be used to trigger relevant workflows. The quarantine workflow should be trigerred by the quarantine push event, while all other workflows should be triggered by the image push event.

If you look at the current implementation of the quarantine feature in ACR, you will notice that both events are fired if the registry quarantine is not enabled (note that the feature is in preview, and functionality may change in the future). I find this behavior confusing. The reason, albeit philosophical, is simple – if the registry doesn’t support quarantine, then it should not send quarantine push events. The behavior should be consistent with any other registry that doesn’t have quarantine capability, and only the image push event should be fired.

What Data Should the Events Contain?

The consumers of the events should be able to make a decision on how to proceed based on the information in the event. The minimum information that needs to be provided in the event should be:

  • Timestamp
  • Event Type: quarantine or push
  • Repository
  • Image Tag
  • Image SHA
  • Actor

This information will allow the event consumers to subscribe to registry events and properly handle them.

Audit Logging for Quarantined Images

Because we are discussing a secure supply chain for containers, we should also think about traceability. For quarantine-enabled registries, a log message should be added at every point the status of the image is changed. Once again, this is something that needs to be implemented by the registry, and it is not standard behavior. At a minimum, you should log the following information:

  • When the image is put into quarantine (initial push)
    • Timestamp
    • Repository
    • Image Tag
    • Image SHA
    • Actor/Publisher
  • When the image is removed from quarantine (quarantine flag is removed)
    Note: if the image is removed from quarantine, the assumption is that is passed all the quarantine checks.

    • Timestamp
    • Repository
    • Image Tag
    • Image SHA
    • Actor/Quarantine Processor
    • Details
      Details can be free-form or semi-structured data that can be used by other tools in the enterprise.

One question that remains is whether a message should be logged if the quarantine does not pass after all actions are completed by the Quarantine Processor. It would be good to get the complete picture from the registry log and understand why certain images stay in quarantine forever. On the other side, though, the image doesn’t change its state (it is in quarantine anyway), and the registry needs to provide an API just to log the message. Because the API to remove the quarantine is not a standard OCI registry API, a single API can be provided to both remove the quarantine flag and log the audit message if the quarantine doesn’t pass. ACR quarantine feature uses the custom ACR API to do both.

Summary

To summarize, if implemented by a registry, the quarantine flag can be useful in preparing the image before allowing its wider use. The quarantine activities on the image should be done by a single Quarantine Processor to avoid concurrency and inconsistencies in the registry. The quarantine flag should be used only during the initial setup of the image before it is released for wider use. Reverting to a quarantine state after the image is published for wider use can be dangerous due to the lack of granularity for actor permissions. Customized policies should continue to be used for images that are published for wider use.

Friendly face on parachute

How often the following happens to you? You write your client code, you call an API, and receive a 404 Not found response. You start investigating the issue in your code; change a line here or there; spend hours troubleshooting just to find out that the issue is on the server-side, and you can’t do anything about it. Well, welcome to the microservices world! A common mistake I often see developers make is returning an improper response code or passing through the response code from another service.

Let’s see how we can avoid this. But first, a crash course on modern applications implemented with microservices and HTTP status response codes.

How Modern Microservices Applications Work?

I will try to avoid going deep into the philosophical reasons why we need microservices and the benefits (or disadvantages) of using them. This is not the point of this post.

We will start with a simple picture.

Microservices ApplicationAs you can see in the picture, we have a User that interacts with the Client Application that calls Microservice #1 to retrieve some information from the server (aka the cloud 🙂). The Client Application may need to call multiple (micro)services to retrieve all the information the User needs. Still, the part we will concentrate on is that Microservice #1 itself can call other services (Microservice #2 in this simple example) on the backend to perform its business logic. In a complex application (especially if not well architected), the chain of service calls may go well beyond two. But let’s stick with two for now. Also, let’s assume that Microservice #1 and Microservice #2 use REST, and their responses use the HTTP response status codes.

A basic call flow can be something like this. I also include the appropriate HTTP status response codes in each step.

  1. The User clicks on a button in the Client Application.
  2. The Client Application makes an HTTP request to Microservice #1.
  3. Microservice #1 needs additional business logic to complete the request and make an HTTP call to Microservice #2.
  4. Microservice #2 performs the additional business logic and responds to Microservice #1 using a 200 OK response code.
  5. Microservice #1 completes the business logic and responds to the Client Application with a 200 OK response code.
  6. The Client Application performs the action that is attached to the button, and the User is happy.

This is the so-called happy path. Everybody expects the flow to be executed as described above. If everything goes as planned, we don’t need to think anymore and implement the functionality behind the next button. Unfortunately, things often don’t go as planned.

What Can Go Wrong?

Many things! Or at a minimum, the following:

  1. The Client Application fails because of a bug before it even calls Microservice #1.
  2. The Client Application sends invalid input when calling Microservice #1.
  3. Microservice #1 fails before calling Microservice #2.
  4. Microservice #1 sends invalid input when calling Microservice #2.
  5. Microservice #2 fails while performing its business logic.
  6. Microservice #1 fails after calling Microservice #2.
  7. The Client Application fails after Microservice #1 responds.

For those cases (non-happy path? or maybe sad-path? 😉 ) the designers of the HTTP protocol wisely specified two separate sets of response codes:

The guidance for those is quite simple:

  • Client errors should be returned if the client did something wrong. In such cases, the client can change the parameters of the request and fix the issue. The important thing to remember is that the client can fix the issue without any changes on the server-side.
    A typical example is the famous 404 Not found error. If you (the user) mistype the URL path in the browser address bar, the browser (the client application) will request the wrong resource from the server (Microservice #1 in this case). The server (Microservice #1) will respond with a 404 Not found error to the browser (the client application) and the browser will show you “Oops, we couldn’t find the page” message. Well, in the past the browser just showed you the 404 Not found error but we learned a long time ago that this is not user-friendly (You see where I am going with this, right?).
  • Server errors should be returned if the issue occurred on the server-side and the client (and the user) cannot do anything to fix it.
    A simple example is a wrong connection string in the service configuration (Microservice #1 in our case). If the connection string used to configure Microservice #1 with the endpoint and credentials for Microservice #2 is wrong, the client application and the user cannot do anything to fix it. The most appropriate error to return in this case would be 500 Internal server error.

Pretty simple and logical, right? Though, one thing, we as engineers often forget, is who the client and who the server is.

So, Who Is the Client and Who Is the Server?

First, the client and server are two system components that interact directly with each other (think, no intermediaries). If we take the picture from above and change the labels of the arrows, it becomes pretty obvious.

Microservices Application Clients and Servers

We have three clients and three servers:

  • The user is a client of the client application, and the client application is a server for the user.
  • The client application is a client of Microservice #1, and Microservice #1 is a server for the client application.
  • Microservice #1 is a client of Microservice #2, and Microservice #2 is a server for Microservice #1.

Having this picture in mind, the engineers implementing each one of the microservices should think about the most appropriate response code for their immediate client using the guidelines above. It is better if we use examples to explain what response codes each service should return in different situations.

What HTTP Response Codes Should Microservices Return?

A few days ago I was discussing the following situation with one of our engineers. Our service, Azure Container Service (ACR), has a security feature allowing customers to encrypt their container images using customer-managed keys (CMK). For this feature to work, customers need to upload a key in Azure Key Vault (AKV). When the Docker client tries to pull an image, ACR retrieves the key from AKV, decrypts the image, and sends it back to the Docker client. (BTW, I know that ACR and AKV are not microservices 🙂 ) Here is a visual:

Docker pull encrypted image from ACR

In the happy-path scenario, everything works as expected. However, a customer submitted a support request complaining that he is not able to pull his images from ACR. When he tries to pull an image using the Docker client, he receives a 404 Not found error, but when he checks in the Azure Portal, he is able to see the image in the list.

Because the customer couldn’t figure it out by himself, he submitted a support request. The support engineer was also not able to figure out the issue, and had to escalate to the product group. It turned out that the customer deleted the Key Vault and ACR was not able to retrieve the key to decrypt the image. However, the implemented flow looked like this:

  1. Docker client requests an image from ACR.
  2. ACR sees that the image is encrypted and requests the key from the Key Vault.
  3. The Azure Key Vault service looks up the key and figures out that the key (or the whole Key Vault) is missing.
  4. Azure Key Vault returns 404 Not found to ACR for the key ACR tries to access.
  5. ACR passes through the 404 Not found error to the Docker client.
  6. Docker client shows a message to the user that the image cannot be found.

The end result: everybody is confused! Why?

Where the Confusion Comes From?

The investigation chain goes from left to right: Docker client –> ACR –> AKV. Both the customer and the support engineer were concentrated on figuring out why the image is missing in ACR. They were looking only at the Docker client –> ACR part of the chain. The customer’s assumption was that the Docker client is doing something wrong, i.e. requesting the wrong image. This would be the correct assumption because 404 Not found is a client error telling the client that is requesting something that doesn’t exist. Hence, the customer checked the portal and when he saw the image in the list, he was puzzled. The next assumption is that something is wrong on the ACR side. Here is where the customer decided to submit a support request for somebody to check if the data in ACR is corrupted. The support engineer checked the ACR backend and all the data was in sync.

This is a great example where the wrong HTTP response code can send the whole investigation into a rabbit hole. To avoid that, here is the guidance! Microservices should return response codes that are relevant to the business logic they implement and ones that help the client take appropriate actions. “Well”, you will say: “Isn’t that the whole point of HTTP status response codes?” It is! But for whatever reasons, we continue to break this rule. The key words in the above guidance are “the business logic they implement”, not the business logic of the services they call. (By the way, this is the same with exceptions. You don’t catch generic Exception, you catch SpecificException. You don’t pass through exceptions, you catch them and wrap them in a useful way for the calling code).

Business Logic and Friendly HTTP Response Codes

Think about the business logic of each one of the services above!

One way to decide which HTTP response code to return is to think about the resource your microservice is handling. ACR is the service responsible for handling the container images. The business logic that ACR implements should provide status codes relavant to the “business” of images. Azure Key Vault implement business logic that handles keys, secrets, and certificates (not images). Key Vault should return status codes that are relevant to the keys, secrets, and certificates. Azure Key Vault is a downstream service and cannot know what the key is used for, hence cannot provide details to the upstream client (Docker) what the error is. It is responsibility of the ACR to provide the approapriate status code to the upstream client.

Here is how the flow in the above scenario should be implemented:

  1. Docker client requests an image from ACR.
  2. ACR sees that the image is encrypted and requests the key from the Key Vault.
  3. The Azure Key Vault service looks up the key and figures out that the key (or the whole Key Vault) is missing.
  4. Azure Key Vault returns 404 Not found to ACR for the key ACR tries to access.
  5. ACR handles the 404 Not found from Azure Key Vault but wraps it in a error that is relevant to the requested image.
  6. Instead 404 Not found, ACR returns 500 Internal server error with a message clarifying the issue.
  7. Docker client shows a message to the user that it cannot pull the image because of an issue on the server.

The Q&A Approach

Another way that you can use to decide what response code to return is to take the Questions-and-Answers approach and build a simple IF-THEN logic (aka. decition tree). Here is how this can work for our example:

  • Docker: Pull image from ACR
    • ACR: Q: Is the image ivailable?
      • A: Yes
        (Note to myself: Requesting the image cannot be a client error anymore.)

        • Q: Is the image encrypted?
          • A: Yes
            • ACR: Request the key from Key Vault
              • AKV: Q: Is the key available?
                • A: Yes
                  • AKV: Return the key to ACR
                • A: No
                  • AKV: Return 404 [key] Not found error
            • ACR: Q: Did I get a key?
              • A: Yes
                • ACR: Decrypt the image
                • ACR: Return 200 OK with the image payload
              • A: No (I got 404 [key] Not found)
                • ACR: I cannot decrypt the image
                  (Note to myself: There is nothing the client did wrong! It is all the server fault)
                • ACR: Return 500 Internal server error “I cannot decrypt the image”
          • A: No (image is not encrypted)
            • ACR: Return 200 OK with the image payload
      • A: No (image does not exist)
        • ACR: Return 404 [image] Not found error

Note that the above flow is simplified. For example, in a real implementation, you may need to check if the client is authenticated and authorized to pull the image. Nevertheless, the concept is the same – you will just need to have more Q&As.

Summary

As you can see, it is important to be careful what HTTP response codes you return from your microservices. If you return the wrong message, you may end up with more work than you expect. Here are the main points that is worth remembering:

  • Return 400 errors only if the client can do something to fix the issue. If the client cannot do anything to fix it, 500 errors are the only appropriate ones.
  • Do not pass through the response codes you receive from upstream services. Handle each response from upstream services and wrap it according to the business logic you are implementing.
  • When implementing your services, think about the resource you are handling in those services. Return HTTP status response codes that are relevant to the resource you are handling.
  • Use the Q&A approach to decide what is the appropriate response code to return for your service and the resource that is requested by the client.

By using those guidelines, your microservices will become more friendly and easier to troubleshoot.

Featured image by Nick Page on Unsplash

In the last few months, I started seeing more and more customers using Azure Container Registry (or ACR) for storing their Helm charts. However, many of them are confused about how to properly push and use the charts stored in ACR. So, in this post, I will document a few things that need the most clarifications. Let’s start with some definitions!

Helm 2 and Helm 3 – what are those?

Before we even start!

Helm 2 is NOT supported and you should not use it! Period! If you need more details just read Helm’s blog post Helm 2 and the Charts Project Are Now Unsupported from Fri, Nov 13, 2020.

A nice date to choose for that announcement 🙂

OK, really – what are Helm 2 and Helm 3? When somebody says Helm 2 or Helm 3, they most often mean the version of the Helm CLI (i.e., Command Line Interface). The easiest way to check what version of the Helm CLI you have is to type:

$ helm version

in your Terminal. If the version is v2.x.x , then you have Helm (CLI) 2; if the version is v3.x.x then, you have Helm (CLI) 3. But, it is not that simple! You should also consider the API version. The API version is the version that you specify at the top of your chart (i.e. Chart.yaml) – you can read more about it in Helm Charts documentation. Here is a table that can come in handy for you:

wdt_ID apiVersion Helm 2 CLI Support Helm 3 CLI Support
1 v1 Yes Yes
2 v2 No Yes

What this table tells you is that Helm 2 CLI supports apiVersion V1, while Helm 3 CLI supports apiVersion V1 and V2. You should check the Helm Charts documentation linked above if you need more details about the differences, but the important thing to remember here is that Helm 3 CLI supports old charts, and (once again) there is no reason for you to use Helm 2.

We’ve cleared (I hope) the confusion around Helm 2 and Helm 3. Let’s see how ACR handles the Helm charts. For each one of those experiences, I will walk you step-by-step.

ACR and Helm 2

Azure Container Registry allows you to store Helm charts using the Helm 2 way (What? I will explain in a bit). However:

Helm 2 is NOT supported and you should not use it!

Now that you have been warned (twice! or was it three times?) let’s see how ACR handles Helm 2 charts. To avoid any ambiguity, I will use Helm CLI v2.17.0 for this exercise. At the time of this writing, it is the last published version of Helm 2.

$ helm version
Client: &version.Version{SemVer:"v2.17.0", GitCommit:"a690bad98af45b015bd3da1a41f6218b1a451dbe", GitTreeState:"clean"}

Initializing Helm and the Repository List

If you have a brand new installation of the Helm 2 CLI, you should initialize Helm and add the ACR to your repository list. You start with:

$ helm init
$ helm repo list
NAME     URL
stable   https://charts.helm.sh/stable
local    http://127.0.0.1:8879/charts

to initialize Helm and see the list of available repositories. Then you can add your ACR to the list by typing:

$ helm repo add --username <acr_username> --password <acr_password> <repo_name> https://<acr_login_server>/helm/v1/repo

For me, this looked like this:

$ helm repo add --username <myacr_username> --password <myacr_password> acrrepo https://tsmacrtestwus2acrhelm.azurecr.io/helm/v1/repo

Here is something very important: you must use the /helm/v1/repo path! If you do not specify the path you will see either 404: Not found error (for example, if you use the root URL without a path) or 403: Forbidden error (if you decide that you want to rename the repo part to something else).

I also need to make a side note here because the authentication can be a bit tricky. The following section applies to both Helm 2 and Helm 3.

Signing In To ACR Using the Helm (any version) CLI

Before you push and pull charts from ACR, you need to sign in. There are few different options that you can use to sign in to ACR using the CLI:

  • Using ACR Admin user (not recommended)
    If you have ACR Admin user enabled, you can use the Admin user username and password to sign in to ACR by simply specifying the --username and --password parameters for the Helm command.
  • Using a Service Principal (SP)
    If you need to push and pull charts using automation, you have most probably already set up a service principal for that. You can authenticate using the SP credentials by passing the app ID in the  --username and the client secret in the --password parameters for the Helm command. Make sure you assign the appropriate role to the service principal to allow access to your registry.
  • Using your own (user) credentials
    This one is the tricky one, and it is described in the ACR docs in az acr login with –expose-token section of the Authentication overview article. For this one, you must use the Azure CLI to obtain the token. Here are the steps:

    • Use the Azure CLI to sign in to Azure using your own credentials:
      $ az login

      This will pop up a browser window or give you an URL with a special code to use.

    • Next, sign in to ACR using the Azure CLI and add the --expose-token parameter:
      $ az acr login --name <acr_name_or_login_server> --expose-token

      This will sign you into ACR and will print an access token that you can use to sign in with other tools.

    • Last, you can sign in using the Helm CLI by passing a GUID-like string with zeros only (exactly this string 00000000-0000-0000-0000-000000000000) in the  --username parameter and the access token in the --password parameter. Here is how the command to add the Helm repository will look like:
      $ helm repo add --username "00000000-0000-0000-0000-000000000000" --password "eyJhbGciOiJSUzI1NiIs[...]24V7wA" <repo_name> https://<acr_login_server>/helm/v1/repo

Creating and Packaging Charts with Helm 2 CLI

Helm 2 doesn’t have out-of-the-box experience for pushing charts to a remote chart registry. You may wrongly assume that the helm-push plugin is the one that does that, but you will be wrong. This plugin will only allow you to push charts to Chartmuseum (although I can use it to try to push to any repo but will fail – a topic for another story). Helm’s guidance on how chart repositories should work is described in the documentation (… and this is the Helm 2 way that I mentioned above):

  • According to Chart Repositories article in Helm documentation, the repository is a simple web server that serves the index.yaml file that points to the chart TAR archives. The TAR archives can be served by the same web server or from other locations like Azure Storage.
  • In Store charts in your chart repository they describe the process to generate the index.yaml file and how to upload the necessary artifacts to static storage to serve them.

Disclaimer: the term Helm 2 way is my own term based on my interpretation of how things work. It allows me to refer to the two different approaches charts are saved. It is not an industry term not something that Helm refers to or uses.

I have created a simple chart called helm-test-chart-v2 on my local machine to test the push. Here is the output from the commands:

$ $ helm create helm-test-chart-v2
Creating helm-test-chart-v2

$ ls -al ./helm-test-chart-v2/
total 28
drwxr-xr-x 4 azurevmuser azurevmuser 4096 Aug 16 16:44 .
drwxr-xr-x 6 azurevmuser azurevmuser 4096 Aug 17 16:29 ..
-rw-r--r-- 1 azurevmuser azurevmuser 342 Aug 16 16:44 .helmignore
-rw-r--r-- 1 azurevmuser azurevmuser 114 Aug 16 16:44 Chart.yaml
drwxr-xr-x 2 azurevmuser azurevmuser 4096 Aug 16 16:44 charts
drwxr-xr-x 3 azurevmuser azurevmuser 4096 Aug 16 16:44 templates
-rw-r--r-- 1 azurevmuser azurevmuser 1519 Aug 16 16:44 values.yaml

$ helm package ./helm-test-chart-v2/
Successfully packaged chart and saved it to: /home/azurevmuser/helm-test-chart-v2-0.1.0.tgz

$ ls -al
total 48
drwxr-xr-x 6 azurevmuser azurevmuser 4096 Aug 17 16:31 .
drwxr-xr-x 3 root root 4096 Aug 14 14:12 ..
-rw------- 1 azurevmuser azurevmuser 780 Aug 15 22:48 .bash_history
-rw-r--r-- 1 azurevmuser azurevmuser 220 Feb 25 2020 .bash_logout
-rw-r--r-- 1 azurevmuser azurevmuser 3771 Feb 25 2020 .bashrc
drwx------ 2 azurevmuser azurevmuser 4096 Aug 14 14:15 .cache
drwxr-xr-x 6 azurevmuser azurevmuser 4096 Aug 15 21:46 .helm
-rw-r--r-- 1 azurevmuser azurevmuser 807 Feb 25 2020 .profile
drwx------ 2 azurevmuser azurevmuser 4096 Aug 14 14:12 .ssh
-rw-r--r-- 1 azurevmuser azurevmuser 0 Aug 14 14:18 .sudo_as_admin_successful
-rw------- 1 azurevmuser azurevmuser 1559 Aug 14 14:26 .viminfo
drwxr-xr-x 4 azurevmuser azurevmuser 4096 Aug 16 16:44 helm-test-chart-v2
-rw-rw-r-- 1 azurevmuser azurevmuser 3269 Aug 17 16:31 helm-test-chart-v2-0.1.0.tgz

Because Helm 2 doesn’t have a push chart functionality, the implementation is left up to the vendors. ACR has provided proprietary implementation (already deprecated, which is another reason to not use Helm 2) of the push chart functionality that is built into the ACR CLI.

Pushing and Pulling Charts from ACR Using Azure CLI (Helm 2)

Let’s take a look at how you can push Helm 2 charts to ACR using the ACR CLI. First, you need to sign in to Azure, and then to your ACR. Yes, this is correct; you need to use two different commands to sign into the ACR. Here is how this looks like for my ACR registry:

$ az login
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code AABBCCDDE to authenticate.
[
  {
    "cloudName": "AzureCloud",
    "homeTenantId": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
    "id": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
    "isDefault": true,
    "managedByTenants": [],
    "name": "ToddySM Sandbox",
    "state": "Enabled",
    "tenantId": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
    "user": {
        "name": "toddysm_XXXXXXXX@outlook.com",
        "type": "user"
    }
  }
]
$ az acr login --name tsmacrtestwus2acrhelm.azurecr.io
The login server endpoint suffix '.azurecr.io' is automatically omitted.
You may want to use 'az acr login -n tsmacrtestwus2acrhelm --expose-token' to get an access token, which does not require Docker to be installed.
An error occurred: DOCKER_COMMAND_ERROR
Please verify if Docker client is installed and running.

Well, I do not have Docker running, but this is OK – you don’t need Docker installed for pushing the Helm chart. Though, it may be confusing because it leaves the impression that you may not be signed in to the ACR registry.

We will push the chart that we packaged already. Pushing it is done with the (deprecated) built-in command in ACR CLI. Here is the output:

$ az acr helm push --name tsmacrtestwus2acrhelm.azurecr.io helm-test-chart-v2-0.1.0.tgz
This command is implicitly deprecated because command group 'acr helm' is deprecated and will be removed in a future release. Use 'helm v3' instead.
The login server endpoint suffix '.azurecr.io' is automatically omitted.
{
  "saved": true
}

This seems to be successful, and I have a Helm chart pushed to ACR using the Helm 2 way (i.e. using the proprietary and deprecated ACR CLI implementation). The problem here is that it is hard to verify that the chart is pushed to the ACR registry. If you go to the portal, you will not see the repository that contains the chart. Here is a screenshot of my registry view in the Azure portal after I pushed the chart:

Azure Portal Not Listing Helm 2 ChartsAs you can see, the Helm 2 chart repository doesn’t appear in the list of repositories in the Azure portal, and you will not be able to browse the charts in that repository using the Azure portal. However, if you use the Helm command to search for the chart, the result will include the ACR repository. Here is the output from the command in my environment:

$ helm search helm-test-chart-v2
NAME                           CHART VERSION        APP VERSION        DESCRIPTION
acrrepo/helm-test-chart-v2     0.1.0                1.0                A Helm chart for Kubernetes
local/helm-test-chart-v2       0.1.0                1.0                A Helm chart for Kubernetes

Summary of the ACR and Helm 2 Experience

To summarize the ACR and Helm 2 experience, here are the main takeaways:

  • First, you should not use Helm 2 CLI and the proprietary ACR CLI implementation for working with Helm charts!
  • There is no push functionality for charts in the Helm 2 client and each vendor is implementing their own CLI for pushing charts to the remote repositories.
  • When you add ACR repository using the Helm 2 CLI you should use the following URL format https://<acr_login_server>/helm/v1/repo
  • If you push a chart to ACR using the ACR CLI implementation you will not see the chart in Azure Portal. The only way to verify that the chart is pushed to the ACR repository is to use the helm search command.

ACR and Helm 3

Once again, to avoid any ambiguity, I will use Helm CLI v3.6.2 for this exercise. Here is the complete version string:

PS C:> helm version
version.BuildInfo{Version:"v3.6.2", GitCommit:"ee407bdf364942bcb8e8c665f82e15aa28009b71", GitTreeState:"clean", GoVersion:"go1.16.5"}

Yes, I run this one in PowerShell terminal 🙂 And, of course, not in the root folder 😉 You can convert the commands to the corresponding Linux commands and prompts.

Let’s start with the basic thing!

Creating and Packaging Charts with Helm 3 CLI

There is absolutely no difference between the Helm 2 and Helm 3 experience for creating and packaging a chart. Here is the output:

PS C:> helm create helm-test-chart-v3
Creating helm-test-chart-v3

PS C:> ls .\helm-test-chart-v3\

    Directory: C:\Users\memladen\Documents\Development\Local\helm-test-chart-v3

Mode         LastWriteTime         Length         Name
----         -------------         ------         ----
d----    8/17/2021 9:42 PM                        charts
d----    8/17/2021 9:42 PM                        templates
-a---    8/17/2021 9:42 PM            349         .helmignore
-a---    8/17/2021 9:42 PM           1154         Chart.yaml
-a---    8/17/2021 9:42 PM           1885         values.yaml

PS C:> helm package .\helm-test-chart-v3\
Successfully packaged chart and saved it to: C:\Users\memladen\Documents\Development\Local\helm-test-chart-v3-0.1.0.tgz

PS C:> ls helm-test-*

    Directory: C:\Users\memladen\Documents\Development\Local

Mode         LastWriteTime         Length         Name
----         -------------         ------         ----
d----    8/17/2021 9:42 PM                        helm-test-chart-v3
-a---    8/17/2021 9:51 PM           3766         helm-test-chart-v3-0.1.0.tgz

From here on, though, things can get confusing! The reason is that you have two separate options to work with charts using Helm 3.

Using Helm 3 to Push and Pull Charts the Helm 2 Way

You can use Helm 3 to push the charts the same way you do that with Helm 2. First, you add the repo:

PS C:> helm repo add --username <myacr_username> --password <myacr_password> acrrepo https://tsmacrtestwus2acrhelm.azurecr.io/helm/v1/repo
"acrrepo" has been added to your repositories

PS C:> helm repo list
NAME         URL
microsoft    https://microsoft.github.io/charts/repo
acrrepo      https://tsmacrtestwus2acrhelm.azurecr.io/helm/v1/repo

Then, you can update the repositories and search for a chart:

PS C:> helm repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "microsoft" chart repository
...Successfully got an update from the "acrrepo" chart repository
Update Complete. ⎈Happy Helming!⎈

PS C:> helm search repo helm-test-chart
NAME                          CHART VERSION     APP VERSION     DESCRIPTION
acrrepo/helm-test-chart-v2    0.1.0             1.0             A Helm chart for Kubernetes

Ha, look at that! I can see the chart that I pushed using the ACR CLI in the ACR and Helm 2 section above – notice the chart name and the version. Also, notice that the Helm 3 search command has a bit different syntax – it wants you to clarify what you want to search (repo in our case).

I can use the ACR CLI to push the new chart that I just created using the Helm 3 CLI (after signing in to Azure):

PS C:> az acr helm push --name tsmacrtestwus2acrhelm.azurecr.io .\helm-test-chart-v3-0.1.0.tgz
This command is implicitly deprecated because command group 'acr helm' is deprecated and will be removed in a future release. Use 'helm v3' instead.
The login server endpoint suffix '.azurecr.io' is automatically omitted.
{
  "saved": true
}

By doing this, I have pushed the V3 chart to ACR and can pull it from there but, remember, this is the Helm 2 Way and the following are still true:

  • You will not see the chart in Azure Portal.
  • The only way to verify that the chart is pushed to the ACR repository is to use the helm search command.

Here is the result of the search command after updating the repositories:

PS C:> helm repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "microsoft" chart repository
...Successfully got an update from the "acrrepo" chart repository
Update Complete. ⎈Happy Helming!⎈

PS C:> helm search repo helm-test-chart
NAME                           CHART VERSION     APP VERSION     DESCRIPTION
acrrepo/helm-test-chart-v2     0.1.0             1.0             A Helm chart for Kubernetes
acrrepo/helm-test-chart-v3     0.1.0             1.16.0          A Helm chart for Kubernetes

You can see both charts, the one created with Helm 2 and the one created with Helm 3, available. This is understandable though because I pushed both charts the same way – by using the az acr helm command. Remember, though – both charts are stored in ACR using the Helm 2 way.

Using Helm 3 to Push and Pull Charts the OCI Way

Before proceeding, I changed the version property in the Chart.yaml to 0.2.0 to be able to differentiate between the charts I pushed. This is the same chart that I created in the previous section Creating and Packaging Charts with Helm 3 CLI.

You may have noticed that Helm 3 has a new chart command. This command allows you to (from the help text) “push, pull, tag, list, or remove Helm charts”. The subcommands under the chart command are experimental and you need to set the HELM_EXPERIMENTAL_OCI environment variable to be able to use them. Once you do that, you can save the chart. You can save the chart to the local registry cache with or without a registry FQDN (Fully Qualified Domain Name). Here are the commands:

PS C:> $env:HELM_EXPERIMENTAL_OCI=1

PS C:> helm chart save .\helm-test-chart-v3\ helm-test-chart-v3:0.2.0
ref:     helm-test-chart-v3:0.2.0
digest:  b6954fb0a696e1eb7de8ad95c59132157ebc061396230394523fed260293fb19
size:    3.7 KiB
name:    helm-test-chart-v3
version: 0.2.0
0.2.0: saved
PS C:> helm chart save .\helm-test-chart-v3\ tsmacrtestwus2acrhelm.azurecr.io/charts/helm-test-chart-v3:0.2.0
ref:     tsmacrtestwus2acrhelm.azurecr.io/charts/helm-test-chart-v3:0.2.0
digest:  ff5e9aea6d63d7be4bb53eb8fffacf12550a4bb687213a2edb07a21f6938d16e
size:    3.7 KiB
name:    helm-test-chart-v3
version: 0.2.0
0.2.0: saved

If you list the charts using the new chart command, you will see the following:

PS C:> helm chart list
REF                                                                  NAME                   VERSION     DIGEST     SIZE     CREATED
helm-test-chart-v3:0.2.0                                             helm-test-chart-v3     0.2.0       b6954fb    3.7 KiB  About a minute
tsmacrtestwus2acrhelm.azurecr.io/charts/helm-test-chart-v3:0.2.0     helm-test-chart-v3     0.2.0       ff5e9ae    3.7 KiB  About a minute

Few things to note here:

  • Both charts are saved in the local registry cache. Nothing is pushed yet to a remote registry.
  • You see only charts that are saved the OCI way. The charts saved the Helm 2 way are not listed using the helm chart list command.
  • The REF (or reference) for a chart can be truncated and you may not be able to see the full reference.

Let’s do one more thing! Let’s save the same chart with FQDN for the ACR as above but under a different repository. Here are the commands to save and list the charts:

PS C:> helm chart save .\helm-test-chart-v3\ tsmacrtestwus2acrhelm.azurecr.io/my-helm-charts/helm-test-chart-v3:0.2.0
ref: tsmacrtestwus2acrhelm.azurecr.io/my-helm-charts/helm-test-chart-v3:0.2.0
digest: daf106a05ad2fe075851a3ab80f037020565c75c5be06936179b882af1858e6a
size: 3.7 KiB
name: helm-test-chart-v3
version: 0.2.0
0.2.0: saved

PS C:> helm chart list
REF                                                                  NAME                   VERSION     DIGEST     SIZE     CREATED
helm-test-chart-v3:0.2.0                                             helm-test-chart-v3     0.2.0       b6954fb    3.7 KiB  11 minutes
tsmacrtestwus2acrhelm.azurecr.io/charts/helm-test-chart-v3:0.2.0     helm-test-chart-v3     0.2.0       ff5e9ae    3.7 KiB  11 minutes
tsmacrtestwus2acrhelm.azurecr.io/my-helm-charts/helm-test-chart...   helm-test-chart-v3     0.2.0       daf106a    3.7 KiB  About a minute

After doing this, we have three charts in the local registry:

  • helm-test-chart-v3:0.2.0 that is available only locally.
  • tsmacrtestwus2acrhelm.azurecr.io/charts/helm-test-chart-v3:0.2.0 that can be pushed to the remote ACR registry tsmacrtestwus2acrhelm.azurecr.io and saved in the charts repository.
  • and tsmacrtestwus2acrhelm.azurecr.io/my-helm-charts/helm-test-chart-v3:0.2.0 that can be pushed to the remote ACR registry tsmacrtestwus2acrhelm.azurecr.io and saved in the my-helm-charts repository.

Before we can push the charts to the ACR registry, we need to sign in using the following command:

PS C:> helm registry login tsmacrtestwus2acrhelm.azurecr.io --username <myacr_username> --password <myacr_password>

You can use any of the sign-in methods described in Signing in to ACR Using the Helm CLI section. And make sure you use your own ACR registry login server.

If we push the two charts that have the ACR FQDN, we will see them appear in the Azure portal UI. Here are the commands:

PS C:> helm chart push tsmacrtestwus2acrhelm.azurecr.io/charts/helm-test-chart-v3:0.2.0
The push refers to repository [tsmacrtestwus2acrhelm.azurecr.io/charts/helm-test-chart-v3]
ref: tsmacrtestwus2acrhelm.azurecr.io/charts/helm-test-chart-v3:0.2.0
digest: daf106a05ad2fe075851a3ab80f037020565c75c5be06936179b882af1858e6a
size: 3.7 KiB
name: helm-test-chart-v3
version: 0.2.0
0.2.0: pushed to remote (1 layer, 3.7 KiB total)

PS C:> helm chart push tsmacrtestwus2acrhelm.azurecr.io/my-helm-charts/helm-test-chart-v3:0.2.0
The push refers to repository [tsmacrtestwus2acrhelm.azurecr.io/my-helm-charts/helm-test-chart-v3]
ref: tsmacrtestwus2acrhelm.azurecr.io/my-helm-charts/helm-test-chart-v3:0.2.0
digest: daf106a05ad2fe075851a3ab80f037020565c75c5be06936179b882af1858e6a
size: 3.7 KiB
name: helm-test-chart-v3
version: 0.2.0
0.2.0: pushed to remote (1 layer, 3.7 KiB total)

And here is the result:

An important thing to note here is that:

  • Helm charts saved to ACR using the OCI way will appear in the Azure portal.

The approach here is a bit different than the Helm 2 way. You don’t need to package the chart into a TAR – saving the chart to the local registry is enough.

We need to do one last thing and we are ready to summarize the experience. Let’s use the helm search command to find our charts (of course using Helm 3). Here is the result of the search:

PS C:> helm search repo helm-test-chart 
NAME                           CHART VERSION     APP VERSION     DESCRIPTION 
acrrepo/helm-test-chart-v2     0.1.0             1.0             A Helm chart for Kubernetes 
acrrepo/helm-test-chart-v3     0.1.0             1.16.0          A Helm chart for Kubernetes

It yields the same result like the one we saw in Using Helm 3 to Push and Pull Charts the Helm 2 Way. The reason is that the helm search command doesn’t work for charts stored the OCI way. This is one limitation that the Helm team is working on fixing and is documented in Support for OCI registries in helm search #9983 issue on GitHub.

Summary of the ACR and Helm 3 Experience

To summarize the ACR and Helm 3 experience, here are the main takeaways:

  • First, you can use the Helm 3 CLI in conjunction with the az acr helm command to push and pull charts the Helm 2 way. Those charts will not appear in the Azure portal.
  • You can also use the Helm 3 CLI to (natively) push charts to ACR the OCI way. Those charts will appear in the Azure portal.
  • OCI features are experimental in the Helm 3 client and certain functionalities like helm search and helm repo do not work for charts saved and pushed the OCI way.

Conclusion

To wrap it up, when working with Helm charts and ACR (as well as other OCI compliant registries), you need to be careful which commands you use. As a general rule, always use the Helm 3 CLI and make a conscious decision whether you want to store the charts as OCI artifacts (the OCI way) or using the legacy Helm approach (the Helm 2 way). This should be a transition period and hopefully, at some point in the future, Helm will improve the support for OCI compliant charts and support the same scenarios that are currently enabled for legacy chart repositories.

Here is a summary table that gives a quick overview of what we described in this post.

wdt_ID Functionality Helm 2 CLI (legacy) Helm 3 (legacy) Helm 3 (OCI)
1 helm add repo Yes Yes No
2 helm search Yes Yes No
3 helm chart push No No Yes
4 helm chart list No No Yes
5 az acr helm push Yes Yes No
6 Chart appears in Azure portal No No Yes
7 Example chart helm-test-chart-v2 helm-test-chart-v3 helm-test-chart-v3
8 Example chart version 0.1.0 0.1.0 0.2.0

With the recent Solorigate incident, a lot of emphasis is put on determining the origin of the software running in an enterprise. For Docker container images, this will mean to embed in the image the Dockerfile the image was built from. However, tracking down the software origin is not so trivial to do. For closed-source software, we blindly trust the vendors and if we are lucky enough, we may get a signed piece of code. For open-source one, we rarely check the SHA signature and never even think of verifying what source code this binary was produced from. In talks with customers, I quite often hear them asking, how can they verify what sources a container image is built from. They want to attribute each image with metadata that links to the Dockerfile used to build the image as well as the Git commit and the developer who triggered the build.

There are many articles that discuss this problem. Here are two recent examples. Richard Lander from the Microsoft .NET team writes in his blog post Staying safe with .NET containers about the pedigree and provenance of the software we run and how to think about it. Josh Hendrick in his post Embedding source code version information in Docker images offers one solution to the problem.

Josh Hendrick’s proposal is in the direction I would go, but one problem I have with it is that it requires special handling in the application that runs in the container to obtain this information. I would prefer to have this information readily available without the need to run the container image. Docker images and the Open Container Initiative already have specified ways to do that without adding special files to your image. In this post, I will outline another way you can embed this information into your images and easily retrieve it without any changes to your application.

Using Docker Image Labels

Docker images spec has already built-in functionality to add labels to the image. Labels are intended to be set during build time. They also show up when inspecting the image using docker image inspect, which makes them the right choice to specify the Dockerfile and the other build origin details. One more argument that makes them the right choice for this information is that the labels are layers in the image, and thus immutable. If you change the label in an image the resulting image SHA will change.

To demonstrate how labels can be used to embed the Dockerfile and other origin information into the Docker image, I have published a dynamic labels sample on GitHub. The sample uses a base Python image and implements a simple functionality to print the container’s environment variables. Let’s walk through it step by step.

The Dockerfile is quite simple.

FROM python:slim
ARG IMAGE_COMMITTER
ARG IMAGE_DOCKERFILE
ARG IMAGE_COMMIT_SHA
LABEL "build.user"=${IMAGE_COMMITTER}
LABEL "build.sha"=${IMAGE_COMMIT_SHA}
LABEL "build.dockerfile"=${IMAGE_DOCKERFILE}
ADD ./samples/dynamic-labels/source /
CMD ["python", "/show_environment.py"]

Lines 2-4 define the build arguments that need to be set during the build of the image. Lines 5-7 set the three labels build.user, build.sha, and build.dockerfilethat we want to embed in the image. build.dockerfile is the URL to the Dockerfile in the GitHub repository, while the build.sha is the Git commit that triggers the build. If you build the image locally with some dummy build arguments you will see that new layers are created for each of the lines 5-7.

toddysm@MacBook-Pro ~ % docker build -t test --build-arg IMAGE_COMMITTER=toddysm --build-arg IMAGE_DOCKERFILE=https://test.com --build-arg IMAGE_COMMIT_SHA=12345 -f .\samples\dynamic-labels\Dockerfile .
Sending build context to Docker daemon  376.3kB
Step 1/9 : FROM python:slim
 ---> 8c84baace4b3
Step 2/9 : ARG IMAGE_COMMITTER
 ---> Running in 71ad05f20d20
Removing intermediate container 71ad05f20d20
 ---> fe56c62b9903
Step 3/9 : ARG IMAGE_DOCKERFILE
 ---> Running in fe468c44e9fc
Removing intermediate container fe468c44e9fc
 ---> b776dca57bd7
Step 4/9 : ARG IMAGE_COMMIT_SHA
 ---> Running in 849a82225c31
Removing intermediate container 849a82225c31
 ---> 3a4c6c23a699
Step 5/9 : LABEL "build.user"=${IMAGE_COMMITTER}
 ---> Running in fd4bfb8d5b5b
Removing intermediate container fd4bfb8d5b5b
 ---> 2e9be17c48ff
Step 6/9 : LABEL "build.sha"=${IMAGE_COMMIT_SHA}
 ---> Running in 892323d73495
Removing intermediate container 892323d73495
 ---> b7bc6559629d
Step 7/9 : LABEL "build.dockerfile"=${IMAGE_DOCKERFILE}
 ---> Running in 98687b8dd9fb
Removing intermediate container 98687b8dd9fb
 ---> 35e97d273cbc
Step 8/9 : ADD ./samples/dynamic-labels/source /
 ---> 9e71859892b1
Step 9/9 : CMD ["python", "/show_environment.py"]
 ---> Running in 366b1b6c3bea
Removing intermediate container 366b1b6c3bea
 ---> e7cb39a21c2a
Successfully built e7cb39a21c2a
Successfully tagged test:latest

You can inspect the image and see the labels by issuing the command docker image inspect --format='{{json .Config.Labels}}' <imagename>.

toddysm@MacBook-Pro ~ % docker image inspect --format='{{json .Config.Labels}}' test | jq
{
  "build.dockerfile":"https://test.com",
  "build.sha":"12345",
  "build.user":"toddysm"
}

Now, let’s automate the process with the help of GitHub Actions. I have created one GitHub Action to build and push the image to DockerHub and another to build and push to Azure Container Registry (ACR). Both actions are similar in the steps they use. The first two steps are the same for both actions. They will build the URL to the Dockerfile using the corresponding GitHub Actions variables:

- name: 'Set environment variable for Dockerfile URL for push'
  if: ${{ github.event_name == 'push' }}
  run: echo "DOCKERFILE_URL=${GITHUB_SERVER_URL}/${GITHUB_REPOSITORY}/blob/${GITHUB_REF#refs/*/}/samples/dynamic-labels/Dockerfile" >> $GITHUB_ENV

- name: 'Set environment variable for Dockerfile URL for pull request'
  if: ${{ github.event_name == 'pull_request' }}
  run: echo "DOCKERFILE_URL=${GITHUB_SERVER_URL}/${GITHUB_REPOSITORY}/blob/${GITHUB_BASE_REF#refs/*/}/samples/dynamic-labels/Dockerfile" >> $GITHUB_ENV

Then, there will be specific steps to sign into DockerHub or Azure. After that, the build steps are the ones where the labels are set. Here, for example, is the build step that buildx and automatically pushes the image to DockerHub:

- name: Build and push
  id: docker_build
  uses: docker/build-push-action@v2
  with:
    context: ./
    file: ./samples/dynamic-labels/Dockerfile
    push: true
    tags: ${{ secrets.DOCKER_HUB_REPONAME }}:build-${{ github.run_number }}
    build-args: |
      IMAGE_COMMITTER=${{ github.actor }}
      IMAGE_DOCKERFILE=${{ env.DOCKERFILE_URL }}
      IMAGE_COMMIT_SHA=${{ github.sha }}

The build step for building the image and pushing to Azure Container Registry uses the traditional docker build approach:

- name: Build and push
  id: docker_build
  uses: azure/docker-login@v1
  with:
    login-server: ${{ secrets.ACR_REGISTRY_LOGIN_SERVER }}
    username: ${{ secrets.ACR_REGISTRY_USERNAME }}
    password: ${{ secrets.ACR_REGISTRY_PASSWORD }}
- run: |
    docker build -f ./samples/dynamic-labels/Dockerfile -t ${{ secrets.ACR_REGISTRY_LOGIN_SERVER }}/${{ secrets.ACR_REPOSITORY_NAME }}:build-${{ github.run_number }} --build-arg IMAGE_COMMITTER=${{ github.actor }} --build-arg IMAGE_DOCKERFILE=${{ env.DOCKERFILE_URL }} --build-arg IMAGE_COMMIT_SHA=${{ github.sha }} .
    docker push ${{ secrets.ACR_REGISTRY_LOGIN_SERVER }}/${{ secrets.ACR_REPOSITORY_NAME }}:build-${{ github.run_number }}

After the actions complete, the images are available in DockerHub and Azure Container Registry. Here is how the image looks like in DockerHub:

Docker container image with labels

If you scroll down a little, you will see the labels that appear in the list of layers:

The URL points you to the Dockerfile that was used to create the image while the commit SHA can be used to identify the latest changes that are done on the project that is used to build the image. If you pull the image locally, you can also see the labels using the command:

toddysm@MacBook-Pro ~ % docker pull toddysm/tmstests:build-36
build-36: Pulling from toddysm/tmstests
45b42c59be33: Already exists
8cd3485318db: Already exists
2f564129f025: Pull complete
cf1573f5a21e: Pull complete
ceec8aed2dab: Pull complete
78b1088f77a0: Pull complete
Digest: sha256:7862c2a31970916fd50d3ab38de0dad74a180374d41625f014341c90c4b55758
Status: Downloaded newer image for toddysm/tmstests:build-36
docker.io/toddysm/tmstests:build-36
toddysm@MacBook-Pro ~ % docker image inspect --format='{{json .Config.Labels}}' toddysm/tmstests:build-36
{
  "build.dockerfile":"https://github.com/CrimsonPinnacle/container-image-inspector/blob/development/samples/dynamic-labels/Dockerfile",
  "build.sha":"e80e6ef86f86a11d6a73aea8d8c41700c4d3d7c5",
  "build.user":"toddysm"
}

To summarize, the benefit of using labels for embedding the Dockerfile and other origin information into the container images is that those are considered immutable layers of the image. Thus, they cannot be changed without changing the image.

Who is Using Docker Image Labels?

Unfortunately, labels are not widely used if at all 🙁 Checking several popular images from DockerHub yields the following results:

toddysm@MacBook-Pro ~ % docker image inspect --format='{{json .Config.Labels}}' busybox | jq
null
toddysm@MacBook-Pro ~ % docker image inspect --format='{{json .Config.Labels}}' alpine | jq 
null
toddysm@MacBook-Pro ~ % docker image inspect --format='{{json .Config.Labels}}' ubuntu | jq 
null

Tracking down the sources from which the Alpine image is built would require much higher effort.

What is Next for Checking Docker Image Origins?

There are a couple of community initiatives that will play a role in determining the origin of container images.

  • Notary V2 will allow images to be signed. Having the origin information embedded into the image and adding an official signature to the image will increase the confidence in the legitimacy of the image.
  • OCI manifest specification allows artifacts (i.e. images) to be annotated with arbitrary metadata. Unfortunately, Docker doesn’t support those yet. Hopefully, in the future, Docker images will add support for arbitrary metadata that can be included in the image manifest.
  • An implementation of metadata service (see metadata service draft from Steve Lasker) as part of the registry will enable additional capabilities to provide origin information for the images.

Summary

While image metadata is great to annotate images with useful information and enable search and querying capabilities, the metadata is kept outside of the image layers and can mutate over time. Verifying the authenticity of the metadata and keeping a history of the changes will be a harder problem to solve. Docker already provides a way to embed the Dockerfile and other image origin information as immutable layers of the image itself. Using dynamically populated Docker image labels, developers can right now provide origin information and increase the supply chain confidence for their images.