Kubernetes image distribution challenge - thoughts and tests
Intro
Imagine the following scenario:
- a fleet of huge (20+ nodes each) Kubernetes clusters
- a multi cloud environment
- 200+ pods that are continuously restarted for continuous delivery practices
- container images' size at least 500MB each (very common for example on ML frameworks)
- most container images share 80% of the layers
Now, without any actions, regardless of the cloud provider, each Kubernetes worker node has its standalone container image store on its file system.
This means that an image layer (or even a full image) that was already downloaded by a worker node needs to be downloaded from the source again by every worker node who need it.
The pitfalls that I see here are:
- Unnecessary download traffic from each Kubernetes worker node if an image was already downloaded at least from one worker node. If the source of container images is outside the cloud environment it could generate network challenges (rate limiting, bandwidth, nat challenges,..);
- slowdown the startup of new replicas on workers that does not have their "private" local cache initialized yet, for example during autoscaling;
- no fault tolerance in case of source registry unavailability.
Overall, this is a huge challenge that most companies are already dealing (or are about to deal with) since container images are the fundamental brick in every Kubernetes system.
So the challenge is clear, it is about to optimize the distribution of container images on a fleet of (huge) Kubernetes clusters in order to address the above pitfalls.
We have also to say that the above thought foundamental must resolve a real company issue, not all organizations could have these aspects to address. Companies' core businesses usually play crucial roles on it.
For example, a temporary outage of the source container registry could not have any impacts on a company while on another could mean having unsatisfied users and consequently lose money (or even lose the market trust).
Sad spoiler: I didn't find an out-of-the-box solution that resolved all the pitfalls above in a way that satisfied also the requirements on my mind, which are:
- the solution should not be invasive and should work without any rework on existing clusters (regardless of the workload type)
- the solution should not be in conflict with Kubernetes native caching system and pull policy
- the solution should integrate itself with container runtime (ex. containerd)
Anyway, I explored the main solutions available nowadays, tested them and produced some finale thoughts.
Keep reading to discover more.
Where I started
I started my readings on the amazing (well documented) journey of Enix.
Then I explored Dragonfly and Uber Kraken which are well integrated also with Harbor that is the most popular choice on the OSS (at least to date).
The journey of Enix was a very well documented step by step that make totally sense in order to reach their clear objective.
kube-image-keeper
While I totally agree with that journey, on the other hand I'm not totally satisfied of the implemented solution which is kube-image-keeper (a.k.a. kuik, which is pronounced /kwɪk/, like "quick"). The idea behind is really interesting, but I'm not a fan of the mutating webhook which rewrites image URLs on the fly, adding a localhost:{port}/
prefix.
Some technologies (for example Istio) use by design mutating webhooks and this could generate conflicts / other challenges.
A fascinating feature of kube-image-keeper is the object storage support, read here to go deep with it.
Dragonfly vs Uber Kraken
Searching on the web I came across than on the two main popular solutions which are Dragonfly and Uber Kraken.
Dragonfly cluster has one or a few "supernodes" that coordinates the transfer of every 4MB chunk of data in the cluster.
While the supernode would be able to make optimal decisions, the throughput of the whole cluster is limited by the processing power of one or a few hosts, and the performance would degrade linearly as either blob size or cluster size increases.
Kraken's tracker only helps orchestrate the connection graph and leaves the negotiation of actual data transfer to individual peers, so Kraken could scale better with large blobs.
On the other hand, Uber Kraken's last release was on May 2020.
Latest release of Dragonfly was October 2023.
I then decided to do some extended tests using Dragonfly!
Dragonfly
Dragonfly is an open source P2P-based file distribution and image acceleration system. It is hosted by the Cloud Native Computing Foundation (CNCF) as an Incubating Level Project.
The first thing to say is that Dragonfly is as powerful as it is complex. It needs to be read very well before trying to imagine this system on an existing environment.
I suppose that the main complexity comes from the fact that Dragonfly was not born for Kubernetes initially, but to distribute any kind of artifacts/blobs regardless of the technology stack. I really felt it while reading the official documentation.
Anyway, Dragonfly did almost the job I had in mind.
It is nicely integrated with containerd. I really liked because containerd is the main container runtime in almost all Kubernetes platform providers.
This integration consists of editing all containerd configurations (through k8s daemonset) in order to enable dragonfly p2p system only for specific registry urls (ex. ghcr.io / quay.io / etc). Dragonfly also manages the containerd restart on all worker nodes.
Well, it is not a thing to be taken lightly (need to be monitored) for 2 reasons:
- On cloud providers, containerd is usually running on managed worker nodes. It means that we are editing a managed component and need to be validated by cloud provider's specialist (for example, to avoid losing official support)
- containerd is the core daemon, without it Kubernetes is literally broken
Unfortunately I was not able to configure object storage as the external image cache and even it is in someway documented inside the Dragonfly's helm chart I started thinking that it could be not integrated at all.
In conclusion, I found in Dragonfly the most near solution despite its complexity.
Follow the next part of the blog post to easily install it (I personally tested on a minikube cluster).
How to install Dragonfly
As I said before, I mainly tried Dragonfly on a minikube cluster, so basically using
minikube start --container-runtime containerd
I before installed the prometheus stack, then I proceed with Dragonfly.
Dragonfly helm chart is well documented on artifachub.
dragonfile-helm-values.yaml
To check if dragonfly is working as expected
# find pods
kubectl -n dragonfly-system get pod -l component=dfdaemon
# find logs
pod_name=dfdaemon-xxxxx
kubectl -n dragonfly-system exec -it ${pod_name} -- grep "peer task done" /var/log/dragonfly/daemon/core.log
Example of output
{
"level":"info",
"ts":"2022-09-07 12:04:26.485",
"caller":"peer/peertask_conductor.go:1500",
"msg":"peer task done, cost: 1ms",
"peer":"10.140.2.175-5184-1eab18b6-bead-4b9f-b055-6c1120a30a33",
"task":"b423e11ddb7ab19a3c2c4c98e5ab3b1699a597e974c737bb4004edeef6016ed2",
"component":"PeerTask"
}
Some useful links
Final thoughts
As you guessed, Dragonfly was the nearer solution to the problem described initially, which respects also the same initially requirements.
It did not convince myself at all for its centralized part of the manager and for the community which is not so active to date.
Going back to the Enix journey, for organizations that really know what they want in terms of technology and have some high level standards to maintain, it probably makes sense to build a tailored caching system.
The positive part is that there are other projects from where we can take inspiration without totally reinventing the wheel.
May the cache be with you.