1 pointby brontoguana6 hours ago1 comment

brontoguana6 hours ago
I built Krasis to run large LLM models (80-235B+ params) on a single consumer GPU by streaming expert weights through VRAM rather than splitting layers between GPU and CPU. Different optimisation strategies for prefill vs decode. On a single 5090 (PCIe 4.0, Q4): Qwen3.5-122B gets ~2,900 tok/s prefill, ~28 tok/s decode. Qwen3-235B gets ~2,100 tok/s prefill, ~9.3 tok/s decode. Even a 5080 16GB can run Qwen3-Coder-Next 80B at 1,800 tok/s prefill. Written in Rust, Python-orchestrated, auto-quantizes from BF16 safetensors, OpenAI-compatible API. Open source, free to use.