Implementing Quarantine Pattern for Container Images
One important step in securing the supply chain for containers is preventing the use of “bad” images. I intentionally use the word “bad” here. For one enterprise, “bad” may mean “vulnerable”; for another, it may mean containing software with an unapproved license; for a third, it may be an image with a questionable signature; possibilities are many. Also, “bad” images may be OK to run in one environment (for example, the local machine of a developer for bug investigation) but not in another (for example, the production cluster). Lastly, the use of “bad” images needs to be verified in many phases of the supply chain – before release for internal use, before build, before deployment, and so on. The decision of whether a container image is “bad” cannot be made in advance and depends on the consumer of the image. One common way to prevent the use of “bad” images is the so-called quarantine pattern. The quarantine pattern prevents an image from being used unless certain conditions are met.
Let’s look at a few scenarios!
Scenarios That Will Benefit from a Quarantine Pattern
Pulling images from public registries and using those in your builds or for deployments is risky. Such public images may have vulnerabilities or malware included. Using them as base images or deploying them to your production clusters bypass any possible security checks, compromising your containers’ supply chain. For that reason, many enterprises ingest the images from a public registry into an internal registry where they can perform additional checks like vulnerability or malware scans. In the future, they may sign the images with an internal certificate, generate a Software Bill of Material (SBOM), add provenance data, or something else. Once those checks are done (or additional data about the image is generated), the image is released for internal use. The public images are in the quarantined registry before they are made available for use in the internal registry with “golden” (or “blessed 🙂 ) images.
Another scenario is where an image is used as a base image to build a new application image. Let’s say that two development teams use the debian:bullseye-20220228
as a base image for their applications. The first application uses  libc-bin
, while the second one doesn’t. libc-bin
in that image has several critical and high severity vulnerabilities. The first team may not want to allow the use of the debian:bullseye-20220228
as a base image for their engineers, while the second one may be OK with it because the libc-bin
vulnerabilities may not impact their application. You need to selectively allow the image to be used in the second team’s CI/CD pipeline but not in the first.
In the deployment scenario, teams may be OK deploying images with the developers’ signatures in the DEV environments, while the PROD ones should only accept images signed with the enterprise keys.
As you can see, deciding whether an image should be allowed for use or not is not a binary decision, and it depends on the intention of its use. In all scenarios above, an image has to be “quarantined” and restricted for certain use but allowed for another.
Options for Implementing the Quarantine Pattern
So, what are the options to implement the quarantine pattern for container images?
Using a Quarantine Flag and RBAC for Controlling the Access to an Image
This is the most basic but least flexible way to implement the quarantine pattern. Here is how it works!
When the image is pushed to the registry, the image is immediately quarantined, i.e. the quarantine
flag on the image is set to TRUE
. A separate role like QuarantineReader
is created in the registry and assigned to the actor or system allowed to perform tasks on the image while in quarantine. This role allows the actor or system to pull the image to perform the needed tasks. It also allows changing the quarantine
flag from TRUE
to FALSE
when the task is completed.
The problem with this approach becomes obvious in the scenarios above. Take, for example, the ingestion of public images scenario. In this scenario, you have more than one actor that needs to modify the quarantine
flag: the vulnerability scanner, the malware scanner, the signer, etc., before the images are released for internal use. All those tasks are done outside the registry, and some of them may run on a schedule or take a long time to complete (vulnerability and malware scans, for example). All those systems need to be assigned the QuarantineReader
role and allowed to flip the flag when done. The problem, though, is that you need to synchronize between those services and only change the quarantine
flag from TRUE
to FALSE
only after all the tasks are completed.
Managing concurrency between tasks is a non-trivial job. This complicates the implementation logic for the registry clients because they need to interact with each other or an external system that synchronizes all tasks and keeps track of their state. Unless you want to implement this logic into the registry, which I would not recommend.
One additional issue with this approach is its extensibility. What if you need to add one more task to the list of things that you want to do on the image before being allowed for use? You need to crack open the code and implement the hooks to the new system.
Lastly, some of the scenarios above are not possible at all. If you need to restrict access to the image to one team and not another, the only way to do it is to assign the QuarantineReader
role to the first team. This is not optimal, though, because the meaning of the role is only to assign it so systems that will perform tasks to take the image out of quarantine and not use it for other purposes. Also, if you want to make decisions based on the content of vulnerability reports or SBOMs, this quarantine
flag approach is not applicable at all.
Using Declarative Policy Approach
A more flexible approach is to use a declarative policy. The registry can be used to store all necessary information about the image, including vulnerability and malware reports, SBOMs, provenance information, and so on. Well, soon, registries will be able to do that 🙂 If your registry supports ORAS reference types, you can start saving those artifacts right now. In the future, thanks to the Reference Types OCI Working Group, every OCI-compliant registry should be able to do the same. How does that work?
When the image is initially pushed to the registry, no other artifacts are attached to it. Each individual system that needs to perform a task on the image can run on its own schedule. Once it completes the task, it pushes a reference type artifact to the registry with the subject of the image in question. Every time the image is pulled from the registry, the policy evaluates if the required reference artifacts are available; if not, the image is not allowed for use. You can define different policies for different situations as long as the policy engine understands the artifact types. Not only that, but you can even make decisions on the content of the artifacts as long as the policy engine is intelligent enough to interpret those artifacts.
Using the declarative policy approach, the same image will be allowed for use by clients with different requirements. Extending this is as simple as implementing a new policy, which in most cases doesn’t require coding.
Where Should the Policy Engine be Implemented?
Of course, the question that gets raised is where the policy engine should be implemented – as part of the registry or outside of it. I think the registries are intended to store the information and not make policy decisions. Think of a registry as yet another storage system – it has access control implemented but the only business logic it holds is how to manage the data. Besides that, there are already many policy engines available – OPA is the one that immediately comes to mind, that is flexible enough to enable this functionality relatively easily. Policy engines are already available, and different systems are already integrated with them. Adding one more engine as part of the registry will just increase the overhead of managing policies.
Summary
To summarise, using a declarative policy-based approach to control who should and shouldn’t be able to pull an artifact from the registry is more flexible and extensible. Adding capabilities to the policy engines to understand the artifact types and act on those will allow enterprises to develop custom controls tailored to their own needs. In the future, when policy engines can understand the content of each artifact, those policies will be able to evaluate SBOMs, vulnerability reports, and other content. This will open new opportunities to define fine-grained controls for the use of registry artifacts.