VM at the Speed of Cloud: Cloud Native Vulnerability Management When the Estate Won't Stay Still

Christopher Clarkson
Mar 24
12 min read

Episode 7 of the CAXA Technologies Security Operations Series

If a container lives for 60 seconds and your scanner runs on a schedule, you do not have a cloud VM programme. You have a cloud visibility gap with a reporting cadence attached to it.

That framing sounds extreme until you look at the data. Sysdig’s 2025 Cloud-Native Security and Usage Report found that 60% of containers now live for 60 seconds or less. In 2019, half of containers lasted at least five minutes. The trend is clear: workloads are getting more ephemeral, not less, and the gap between container lifespan and scan interval is widening in the wrong direction.

Most VM programmes were not designed for this. They were designed for a world where the estate is a list of known hosts that persist, can be scheduled for scanning, and can be patched in place once a vulnerability is found. That model still works for long-lived virtual machines and physical servers. It does not work for containers, auto-scaling groups, serverless functions, or any infrastructure provisioned on-demand from code. The estate changes faster than the scanner can follow it.

This episode is about what Cloud Native Vulnerability Management looks like when the asset model changes underneath it: how detection needs to shift to earlier in the lifecycle, where ownership sits when developers provision infrastructure, and what happens to your remediation SLA framework when the thing you are supposed to be patching might not exist by the time you have finished triaging it.

Why the Traditional VM Scanning Model Fails for Cloud-Native Infrastructure

Traditional vulnerability scanning is built on a straightforward premise: you maintain an inventory of your assets, you run a credentialed scan against them on a schedule, and you triage and remediate what the scanner finds. The scan frequency determines your detection window. A weekly scan gives you a seven-day detection window. A daily scan gives you a one-day window. The shorter the window, the faster you find new vulnerabilities, and the more accurately your open-findings list reflects the current state of the estate.

This model has one structural dependency: the assets need to still be there when the scanner arrives.

For a server estate, that dependency holds. Servers have maintenance windows, change processes, and typically live for months or years. Even EC2 instances in a traditional lift-and-shift migration tend to be long-lived by cloud standards. The periodic scanning model works because the thing being scanned is still there.

For containerised workloads, that dependency fails completely. A container destroyed before the scanner reaches it is invisible to the programme: not a gap that will be closed next scan cycle, but permanently invisible to it. And if 60% of containers live for under a minute, the scanner does not need to be running weekly to miss them. It can be running every 15 minutes and still see almost nothing of a typical containerised estate.

The deeper problem is that scan coverage, the metric many programmes report to demonstrate VM effectiveness, becomes meaningless in this environment. You can show 95% coverage against your registered asset inventory while genuinely scanning 0% of your containerised estate, because the containerised estate never appeared in the inventory in the first place. NIST CSF 2.0’s Identify function (ID.AM-01 and ID.AM-02) requires continuous asset inventory, not a periodic one, precisely because static discovery cannot keep pace with dynamic infrastructure. Most implementations have not caught up with that requirement.

Scanning the wrong layer

The failure mode I see most often is not that organisations are scanning infrequently. It is that they are scanning the wrong layer. A team can have strong scan coverage across their cloud infrastructure (EC2 instances, managed databases, Kubernetes nodes) while having zero visibility into the containerised workloads running on top of it. The infrastructure and the workloads are different layers, and a scanner that reaches the node does not automatically see inside the containers running on that node.

Compound this with the dynamics of modern cloud architecture: auto-scaling groups that spin instances up and down based on load; serverless functions that execute on demand and have no persistent presence; blue-green deployments that replace running containers wholesale rather than patching them; and infrastructure provisioned entirely through Terraform or Helm charts that creates and destroys resources faster than any manual inventory process can track.

The asset inventory stops being a list and starts being a stream. The vulnerability programme has to treat it accordingly, which means the tool choice and the detection model both need to change.

Detection at the right point in the lifecycle

The answer is not to scan faster. It is to move detection to the point in the lifecycle where the artefact is stable enough to scan and where a finding can actually result in a fix.

For containerised workloads, that point is the image: the package of code, runtime, and dependencies that gets built once and deployed many times. A container is ephemeral. The image it was launched from is not. Scan the image, and you have scanned every container that will ever run from it.

The CNCF Cloud Native Security Whitepaper (v2, May 2022) frames this as a lifecycle model: Develop, Distribute, Deploy, Runtime. Each stage has different security tooling appropriate to it. For VM purposes, the relevant stages are Distribute (image-layer scanning before the image is pushed or deployed) and Runtime (continuous posture assessment of what is actually running, via the cloud control plane rather than agents inside containers).

Image-layer scanning (Trivy and Grype are the common open-source tools; AWS ECR Enhanced Scanning and Google Artifact Registry have this built in) works by parsing the layers of a container image and comparing installed packages against CVE databases. The trade-off worth understanding is reachability: image scanners report vulnerabilities in every package present in the image, regardless of whether the application actually calls that code. At scale, this produces findings volume that has not been triaged against reachability, and developers quickly learn to route around gates that block them for vulnerabilities in libraries the application does not use. Episode 6’s prioritisation model (CVSS + EPSS + KEV) still applies here, but the first filter is whether the vulnerable component is reachable by the running application.

Build-time scanning is necessary but not sufficient on its own. An image that passes scanning today becomes vulnerable tomorrow when a new CVE is disclosed against a package it contains. The image does not change; the CVE database does. The answer is continuous registry rescanning: cloud-native registries (AWS ECR, Google Artifact Registry) can re-evaluate stored images against updated CVE databases without requiring a new build; GitHub Container Registry requires a pipeline step (a scheduled Trivy or Grype workflow) to achieve the equivalent. An image that passed clean six months ago is re-assessed when a new high-severity CVE lands. This closes the point-in-time gap without requiring every image to be rebuilt on a schedule.

CSPM and CNAPP tools (Wiz, Orca, and others in this class) take a different approach that addresses the ephemeral scanning problem directly. Rather than deploying agents to running workloads, they connect to the cloud provider API and read the cloud control plane: what images are deployed, where, with what configuration. Because they are reading inventory from the cloud API rather than scanning running containers, they see workloads that a traditional scanner would miss entirely. The scanner does not need to reach a container that lives 60 seconds; it already knows what image that container was launched from because it read the deployment record from the cloud provider.

The practical limitation of agentless CSPM is the flip side of its strength: reading the cloud inventory tells you what the workload contains, not what it is doing at runtime. For most VM purposes, knowing what packages are deployed and whether those packages carry known vulnerabilities is sufficient. Where you need runtime telemetry (whether a vulnerability is being actively exploited, or whether a suspicious process is running inside a container) you need a runtime sensor, which reintroduces some of the complexity you were avoiding. Tools that layer runtime awareness on top of an agentless inventory model (connecting runtime behaviour to the image-level findings) represent the architecture worth understanding if you need both coverage and runtime context.

IaC scanning (Checkov, tfsec) sits adjacent to all of this. It analyses Terraform, Helm charts, and Kubernetes manifests for security misconfigurations before the infrastructure is provisioned. This is primarily a posture and configuration category rather than a CVE/software vulnerability category, but it belongs in this conversation because IaC is where the configuration choices live that determine whether vulnerabilities in your application layer can be exploited. Whether a container runs as root, whether network policies restrict lateral movement, whether secrets are hardcoded in manifests: these are not software vulnerabilities, but they are the conditions that transform a medium-severity CVE into a critical exposure. The right place to catch them is before apply, not after.

Serverless functions introduce a distinct pattern alongside containers. A Lambda function, or its equivalent in GCP Cloud Functions or Azure Functions, never has a persistent runtime to scan: it executes in response to an event and terminates. The vulnerability surface is what you bundle into the deployment package: the dependencies included in the zip archive or container image that the function runs from. The underlying runtime (the Python, Node.js, or Java environment the cloud provider supplies) is managed by the provider. CVEs in the runtime are their problem. CVEs in your bundled libraries are yours.

For functions deployed as container images, the detection model is identical to standard container scanning: image-layer scanning at build time, continuous registry rescanning as new CVEs are disclosed. For functions deployed as zip packages, the equivalent is a filesystem scan against the deployment directory before the package is assembled; Trivy supports this without modification. AWS Inspector v2 extended native Lambda function scanning in 2022: it pulls the deployment package and evaluates bundled dependencies against CVE databases directly, without requiring pipeline integration. For organisations already running Inspector, enabling Lambda scanning adds serverless coverage without additional tooling.

Lambda Layers warrant specific attention. Layers are shared dependency packages referenced by multiple functions. A vulnerable library in a Layer propagates to every function that uses it: the same dynamic as a base image CVE propagating to every container derived from it. Without an inventory of which functions reference which Layers, and a process for rescanning Layers as new CVEs are disclosed, high-severity findings can be silently present across dozens of functions with no single team aware of the blast radius. CSPM tooling that reads Lambda configuration from the cloud control plane gives you the Layer-to-function mapping needed to assess exposure when a CVE lands in a shared dependency.

The ownership model breaks too

The asset management pillar Episode 3 established assumes an asset has a discoverable owner. In cloud-native infrastructure, ownership of a vulnerability is rarely clean.

When a developer writes a Dockerfile that builds from ubuntu:22.04, who owns a high-severity CVE that lands in the base image? Depending on how the organisation is structured, the answer might be the development team (they referenced the base image), the platform engineering team (they maintain the approved base image catalogue), or security (they set the policy on base image currency and review cycles). Without explicit ownership assignment, the answer in practice is usually nobody: each team believes it belongs to one of the others, and the finding ages out unresolved.

The split that works in organisations that have this functioning well is this: the application team owns vulnerabilities in the application layer (packages they explicitly depend on, code they wrote); the platform or security engineering function owns the base image standard (which base images are approved, what version pinning policy applies, how frequently they are updated); and the operations or SRE function owns the infrastructure configuration layer (IaC, cluster configuration, network policies). Security’s role in this model is not to own all the findings, but to own the triage, prioritisation, and SLA framework that makes each owner’s responsibilities explicit and measurable.

This matters for how you configure your scanning and reporting pipeline. If every container image finding routes to “the developer” and every Kubernetes configuration finding routes to “the ops team”, you will get low remediation rates and high noise complaints from both. If findings route to the team that controls the relevant artefact, with context about severity and SLA expectation, you get a workable programme. The routing logic is not a tooling problem; it is a programme design decision that has to be made before the tools are configured.

A VM programme that lacks this split also creates a specific failure mode around base image vulnerabilities. Because base images are inherited by every application image derived from them, a single unpatched base image can generate the same CVE across dozens or hundreds of application images. Without a team that owns the base image standard, every application team sees the same finding and assumes another team is addressing it at the base level. Nobody is. I have seen vulnerability backlogs dominated by a small number of base image CVEs that had been open for six months because the ownership question had never been resolved.

What happens to your SLAs

Episode 6’s prioritisation model produces a tiered output that feeds directly into remediation SLAs. A programme running that model typically sets its fastest SLA at 24 hours for the highest-priority findings on internet-facing systems, and seven days for the tier below. Those SLAs were built on an assumption worth making explicit: the vulnerable asset is running, persistent, and patchable in place.

For ephemeral workloads, that assumption does not hold. Patching a running container is meaningless: the patch does not persist, and the next deployment launches from the same unpatched image. The remediation unit is the image, not the running container. Fixing a container vulnerability means updating the Dockerfile or the base image dependency, rebuilding, pushing to the registry, and redeploying. That is the correct remediation workflow. SSHing into a running container and applying a patch at the OS level is invisible to the VM programme, does not survive the next deployment, and produces a false sense of closure.

The SLA clock needs to reflect this. For ephemeral workloads, it cannot start when the vulnerability is found on a running instance, because that instance may not exist by the time remediation is scheduled. It should start when the vulnerable image is identified: in the build pipeline, in the registry, or via CSPM inventory correlation. It should close when a clean image is built and successfully deployed to production.

Episode 5’s MTTR measurement needs the same adjustment. MTTR for containerised workloads is the time from image-vulnerability-detected to clean-image-deployed, not from vulnerability-on-host to patch-applied. Measuring it the traditional way produces one of two outcomes: meaningless numbers (recording a patch on a container that was destroyed before the patch was applied) or misleading ones (closing findings because the container was destroyed, not because the image was fixed). Neither tells you whether the programme is working.

For serverless functions, the remediation unit is the deployment package, not the function invocation. Fixing a vulnerability means updating the dependency in the package, rebuilding, and redeploying the function. The SLA clock starts when the vulnerable package is identified (in the build artefact, in Inspector findings, or via CSPM inventory) and closes when a clean deployment reaches production. Closing a finding because a function invocation terminated is not remediation.

None of this invalidates the Episode 6 prioritisation model. EPSS, the CISA KEV catalogue, and business context still tell you which vulnerabilities matter most and in what order to address them. What changes is the remediation action and the unit of measurement. The prioritisation logic is the same; the workflow it drives is different.

Cloud Native Vulnerability Management: What the Programme Looks Like in Practice

A VM programme designed for cloud-native infrastructure has these properties. Asset discovery is continuous and driven from the cloud provider API, not from periodic scans against a static inventory. Detection happens at the artefact layer before workloads are deployed (container images, serverless deployment packages, and IaC) and continuously against stored artefacts as new vulnerabilities are disclosed. Ownership is explicit: application teams, platform engineering, and operations each own the layer they control, with security owning the triage, prioritisation, and SLA framework that connects them. SLAs are measured against the lifecycle of the artefact (the image) rather than the lifecycle of the instance.

None of this requires replacing your existing scanning approach for long-lived infrastructure. EC2 instances, managed databases, and Kubernetes nodes are still well served by traditional credentialed scanning. The change is additive: cloud-native workloads need detection at earlier points in the lifecycle, and those points require different tooling and different ownership models running alongside what you already have.

The scan coverage percentage you report today means something specific for persistent infrastructure. For containerised workloads, that number tells you almost nothing. The metric that matters for a cloud-native estate is the percentage of deployed images that have been assessed: at build time, via continuous registry rescanning, or via CSPM inventory correlation. If you cannot answer that question with confidence, you do not yet have visibility into your containerised estate, regardless of what the coverage dashboard shows.

Three things worth checking before the next sprint:

First, establish what percentage of your running containers are covered by image-level scanning, not host-level scanning of the node they run on. In my experience, below 80% is typically where the risk-to-effort case for image scanning is most clear-cut. Start there: it is the highest-leverage change available to a programme that currently has none.
Second, identify who owns the approved base image list in your organisation. If no single team owns it, that is where your highest-volume, lowest-triage-effort findings are accumulating. Assign ownership before tuning anything else.
Third, revisit any SLA reporting that measures remediation on containerised or serverless workloads as time-from-finding-on-host to patch-applied. That measurement is producing numbers, but they are not telling you what you think they are. For containers, redefine the clock to image-vulnerability-detected to clean-image-deployed.
- For serverless, it starts when the vulnerable package is identified and closes when a clean deployment reaches production. Redefine the clock on either and you will immediately have a clearer picture of where the programme is working and where it is not.