The 1979 Design Choice Breaking AI Workloads(www.cerebrium.ai)

22 pointsby za_mike1574 hours ago10 comments

pocksuppet3 hours ago
Clickbait title. Summary: Their AI docker containers are slow to start up because they are 10GB layers that have to be gunzipped, and gzip doesn't support random access.
andrewvc3 hours ago
They say an ideal container system would download portions of layers on demand, however is seems far from ideal for many production workloads. What if your service starts, works fine for an hour, then needs to read one file that is only available over the network, but that endpoint is unreachable? What if it is reachable but it is very very slow?
The current system has issues with network stuff, but in a deploy process you can delineate that all to a new container deployment. Perhaps you try to deploy a new container and it fails because the network is slow or broken. Rollback is simpler there. Spreading network issues over time makes debugging much harder.
The current system is simple and resilient but clearly not fast. Trading speed for more complex failure modes for such a widely distributed technology is hardly a clear win.
The de-duplication seems like a neat win however.
- jono_irwin2 hours ago
  Good point, network dependency is a valid concern.
  In practice these systems typically fetch data over a local, highly available network and aggressively cache anything that gets read. If that network path becomes unavailable, it usually indicates a much larger infrastructure issue since many other parts of the system rely on the same storage or registry endpoints.
  So while it does introduce a different failure mode, in most production environments it ends up being a low practical risk compared to the startup latency improvements.
  For us and our customers, the trade off is worth it.
MontyCarloHall3 hours ago
I ran into a similar issue years ago, where the base infrastructure occupied the lion's share of the container size, very similar to the sizes shown in the article:
```
   Ubuntu base      ~29 MB compressed
   PyTorch + CUDA   7 – 13 GB
   NVIDIA NGC       4.5+ GB compressed
```
The easy solution that worked for us was to bake all of these into a single base container, and force all production containers built within the company to use that base. We then preloaded this base container onto our cloud VM disk images, so that pulling the model container only needed to download comparatively tiny layers for model code/weights/etc. As a benefit, this forced all production containers to be up-to-date, since we regularly updated the base container which caused automatic rebuilding of all derived containers.
- jono_irwin2 hours ago
  That approach works really well when you have a stable shared base image.
  Where it starts to get harder is when you have multiple base stacks (different CUDA versions, frameworks, etc.) or when you need to update them frequently. You end up with lots of slightly different multi-GB bases.
  Chunked images keep the benefit you mentioned (we still cache heavily on the nodes) but the caching happens at a finer granularity. That makes it much more tolerant to small differences between images and to frequent updates, since unchanged chunks can still be reused.
dsr_2 hours ago
The problem: "containers that take far too long to start".
Somehow, they don't hit upon the solution other organizations use: having software running all the time.
I suppose if you have a lousy economic model where the cost of running your software is a large percentage of your overall costs, that's a problem. I can only advise them to move to a model where they provide more value for their clients.
- za_mike1572 hours ago
  A lot of AI workloads require GPUs which are expensive so customers would waste money running idle machines 24/7 with low utilisation which kills gross margins. By loading containers quickly means, means we can scale up quickly as requests come in and you only need to pay for usage.
  This is successful for CPU workloads (AWS Lambda) but AI models and images are 50x the size
  - dsr_2 hours ago
    As I said, if only you were providing more value rather than being a commodity, you could avoid all this.
cosmotic2 hours ago
Why does the model data need to be stored in the image? Download the model data on container startup using whatever method works best.
- za_mike1572 hours ago
  You are correct! From our tests, storing model weights in the image actually isn't a preferred approach for model weights larger than ~1GB. We run a distributed, multi-layer cache system to combat this and we can load roughly 6-7GB of files in p99 of <2.5s
- jono_irwin2 hours ago
  hey cosmotic, we're not really advocating for storing model weights in the container image.
  even the smaller nvidia images (like nvidia/cuda:13.1.1-cudnn-runtime-ubuntu24.04) are about 2Gb before adding any python deps and that is a problem.
  if you split the image into chunks and pull on-demand, your container will start much faster.
  - fwipan hour ago
    Just pre-install the NVIDIA layer on the filesystem instead of docker-pulling it for every single machine.
alanfranz3 hours ago
Looks like they'd like something git repositories (maybe with transparent compression on top) rather than .tar.gz files. Just pull the latest head and you're done.
formerly_proven3 hours ago
The gzip compression of layers is actually optional in OCI images, but iirc not in legacy docker images. The two formats are not the same. On SSDs, the overhead for building an index for a tar is not that high, if we're primarily talking about large files (so the data/weights/cuda layers instead of system layers). The approach from the article is of course still faster, especially for running many minor variations of containers, though I am wondering how common it is for only some parts of weights changing? I would've assumed that most things you'll do with weights would change about 100% of them when viewed through 1M chunks. The lazy pulling probably has some rather dubious/interesting service latency implications.
The main annoyance imho with gzip here is that it was already slow when the format was new (unless you have Intel QAT and bothered to patch and recompile that into all the go binaries which handle these, which you do not).
- jono_irwinan hour ago
  Yeah that’s fair. For weights specifically there often isn’t a huge dedupe win across versions since retraining tends to change most of them. That said, we generally don’t advocate including model weights in container images anyway. The main benefit for us is avoiding the need to pull the full image up front and only fetching the data actually touched during startup. On the latency side, reads happen over a local network with caching and prefetching, so the impact on request latency is typically minimal.
notyourbiz2 hours ago
Super helpful.
- za_mike1572 hours ago
  Glad you liked it!
aplomb10263 hours ago
[dead]
PaulHoule4 hours ago
I remember dealing with this BS back in 2017. It was clear to me that containers were, more than anything else, a system for turning 15MB of I/O into 15GB of I/O.
So wow and new shiny though so if you told people that they would just plug their ears with their fingers.
- pocksuppet3 hours ago
  This doesn't follow from anything in the article.
  - PaulHoule2 hours ago
    I was working with prototypical foundation models and having the exact same problem. My diagnosis wasn't quite the same, I think more radical gains could be had with a "stamp out unnecessary copies everywhere" policy but it looks like he did get through a bottleneck. The thing is he is happy with 3x speedup whereas I was looking for more of 300x except that, of course, if it takes you 20 min to sling containers and 5 min to do real work you'll probably be happy to 3x the container slinging.