DirectStorage LLM Weight Streaming: 4x faster loading, MoE expert streaming(github.com)

1 pointby kibbyd19853 hours ago1 comment

kibbyd19853 hours ago
Author here. This project started with a simple question: can you run a 70B MoE model on 8GB VRAM by streaming weights from NVMe SSD to GPU using DirectStorage?
The short answer: the streaming works, but public MoE models don't cooperate.
The long version:
*What works well:* DirectStorage uses DMA to transfer weights from NVMe SSD to GPU via D3D12 staging buffers, skipping the OS page cache that standard I/O relies on. I built a C++ DLL (MSVC) that handles DirectStorage + D3D12 + CUDA interop, with Go bindings loaded via syscall (no CGO), integrated into Ollama's Backend.Load(). Double-buffered staging with D3D12 fences imported as CUDA external semaphores. On codestral (12.6 GB, 57 layers), it loads 4.1x faster than stock Ollama — the advantage grows with model size because standard I/O depends on OS page cache.
Note: the weights still need VRAM and RAM — DirectStorage changes the transfer path, not where the weights end up. The win is that DMA doesn't depend on the OS cache being warm.
*The MoE work:* Built full expert streaming — CUDA VMM for sparse-resident pools, lazy physical allocation, on-demand SSD→GPU streaming during Forward(), one-token-lag exact routing (use token t's expert indices to prefetch for t+1), LRU eviction. Ran qwen3:30b (128 experts/layer, 8 active) on 40GB RAM + 8GB VRAM. Pipeline sustains ~1.9 GB/s.
*Where it breaks:* Both models tested (gpt-oss:20b, qwen3:30b) are temporally dense. Over ~50 tokens, every expert gets touched. Reducing cache capacity by 25% causes >1000 faults/token. The temporal working set equals the full model.
The hardest bugs were: (1) Windows DLL search order differences between EXE and DLL contexts causing E_NOTIMPL, (2) D3D12 picking Intel iGPU while CUDA was on NVIDIA dGPU (LUID matching fixed it), (3) D3D12 fence completion not establishing memory visibility for CUDA — had to import the fence as a CUDA external semaphore.
The evaluation harness (max_resident_per_layer, faulted_experts_per_token) is probably the most useful piece — it can immediately tell you if a new MoE model is temporally sparse enough for small-VRAM inference. If anyone knows of MoE models trained with temporal locality objectives, I'd love to test them.
Repos: - https://github.com/kibbyd/llm_upper (research & docs) - https://github.com/kibbyd/llm_upper_ollama (Ollama fork) - Full writeup: https://github.com/kibbyd/llm_upper/blob/main/PROJECT_RECORD...