Skip to content

Conversation

@mzihlmann
Copy link
Collaborator

@mzihlmann mzihlmann commented Sep 6, 2025

Copied from coder/kaniko#35
Fixes GoogleContainerTools/kaniko#3431

Description

After enabling image streaming, we experienced issues in GCP with pods failing to start with the error message

0s (x11 over 2m11s)   Warning   Failed      Pod/<podname>   (combined from similar events): Error: failed to create containerd container: failed to mount /var/lib/containerd/tmpmounts/containerd-mount3540351974: too many levels of symbolic links

See https://cloud.google.com/kubernetes-engine/docs/troubleshooting/known-issues#image-screaming-too-many-links

This error was caused by duplicated layers in the image, specifically empty layers in the form

RUN mkdir /app
WORKDIR /app # -> empty
RUN echo "blubb" # -> empty

Prior to v1.25.0 our kaniko implementation was targetting legacy docker, which means we avoided emitting empty layers altogether. So we were able to workaround the above issue by kaniko not emitting those layers. But starting in v1.25.0 our target implementation was switched to buildkit #81 This means empty layers are now emitted and there is no way for the user to deactivate this behaviour.

I'm not sure this is the best solution, as there is the very very slim chance for a hash collision and therefore confusion of layers. I have not yet investigated how buildkit handles this situation, but in our deployment this approach resolves the issue at hand.

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

  • Includes unit tests
  • Adds integration tests if needed.

See the contribution guide for more details.

Reviewer Notes

  • The code flow looks good.
  • Unit tests and or integration tests added.

Release Notes

Describe any changes here so maintainer can include it in the release notes, or delete this block.

Examples of user facing changes:
- kaniko adds a new flag `--registry-repo` to override registry

@mzihlmann
Copy link
Collaborator Author

mzihlmann commented Sep 6, 2025

note that google's version of kaniko was inconsistent in it's handling of empty layers in RUN statements. It would not emit empty layers during initial build, but it would do so during rebuild from cache #19. So, if you're using google's or chainguard's version you might hit that issue only on cache rebuild.

@mzihlmann
Copy link
Collaborator Author

In our case we experienced this issue when building this specific dockerfile

FROM nvcr.io/nvidia/tritonserver:23.11-py3@sha256:9348c4eccc576fd88cbbf81ab93875d22a3487f63572e547e001e06401aeef39 

RUN mkdir /opt/project
WORKDIR /opt/project/

The reason is that in the tritonserver image there is an empty layer to begin with
image

note the 5f70... shasum

we hit the same digest here too
image

@mzihlmann
Copy link
Collaborator Author

mzihlmann commented Sep 6, 2025

the only thing that remains to test is whether buildkit runs into the same issue or whether they have some workaround on their end already.

images produced with buildkit do not run into this issue, it would be interesting to see how they handle this situation.
i guess it might be somewhere around here https://github.com/moby/buildkit/blob/c04c1ce80cf73d3e6326971ab17f1a9608229c39/exporter/containerimage/writer.go#L424

@mzihlmann mzihlmann marked this pull request as ready for review September 6, 2025 14:07
@mzihlmann mzihlmann merged commit 8e4f613 into main Sep 6, 2025
14 checks passed
@mzihlmann mzihlmann deleted the coder-35 branch September 8, 2025 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Kaniko fails to push images with duplicate layers with identical diff IDs but differing blobs

3 participants