The current system has issues with network stuff, but in a deploy process you can delineate that all to a new container deployment. Perhaps you try to deploy a new container and it fails because the network is slow or broken. Rollback is simpler there. Spreading network issues over time makes debugging much harder.
The current system is simple and resilient but clearly not fast. Trading speed for more complex failure modes for such a widely distributed technology is hardly a clear win.
The de-duplication seems like a neat win however.
In practice these systems typically fetch data over a local, highly available network and aggressively cache anything that gets read. If that network path becomes unavailable, it usually indicates a much larger infrastructure issue since many other parts of the system rely on the same storage or registry endpoints.
So while it does introduce a different failure mode, in most production environments it ends up being a low practical risk compared to the startup latency improvements.
For us and our customers, the trade off is worth it.
Ubuntu base ~29 MB compressed
PyTorch + CUDA 7 – 13 GB
NVIDIA NGC 4.5+ GB compressed
The easy solution that worked for us was to bake all of these into a single base container, and force all production containers built within the company to use that base. We then preloaded this base container onto our cloud VM disk images, so that pulling the model container only needed to download comparatively tiny layers for model code/weights/etc. As a benefit, this forced all production containers to be up-to-date, since we regularly updated the base container which caused automatic rebuilding of all derived containers.Where it starts to get harder is when you have multiple base stacks (different CUDA versions, frameworks, etc.) or when you need to update them frequently. You end up with lots of slightly different multi-GB bases.
Chunked images keep the benefit you mentioned (we still cache heavily on the nodes) but the caching happens at a finer granularity. That makes it much more tolerant to small differences between images and to frequent updates, since unchanged chunks can still be reused.
Somehow, they don't hit upon the solution other organizations use: having software running all the time.
I suppose if you have a lousy economic model where the cost of running your software is a large percentage of your overall costs, that's a problem. I can only advise them to move to a model where they provide more value for their clients.
This is successful for CPU workloads (AWS Lambda) but AI models and images are 50x the size
even the smaller nvidia images (like nvidia/cuda:13.1.1-cudnn-runtime-ubuntu24.04) are about 2Gb before adding any python deps and that is a problem.
if you split the image into chunks and pull on-demand, your container will start much faster.
The main annoyance imho with gzip here is that it was already slow when the format was new (unless you have Intel QAT and bothered to patch and recompile that into all the go binaries which handle these, which you do not).
So wow and new shiny though so if you told people that they would just plug their ears with their fingers.