> Key technique: selective expert streaming via direct I/0. Only ~10 of 512 experts per layer are loaded from SSD per token (~1.8GB I/0 per token at 1.4 GB/s effective bandwidth). Non-expert weights (~5GB) are pinned in DRAM. LRU expert cache provides 44%+ hit rate.
It's apparently using ideas from: https://arxiv.org/abs/2312.11514
> This paper tackles the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters in flash memory, but bringing them on demand to DRAM. Our method involves constructing an inference cost model that takes into account the characteristics of flash memory, guiding us to optimize in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks.
Running 397B on consumer hardware is genuinely impressive for a proof of concept. A year ago this wasn't a thing. But I keep wondering whether a well-quantized 70B that fits entirely in RAM would just be faster in practice. No I/O bottleneck, consistent throughput, smaller model but actually usable.