I've been working on SnapLLM for a while now and wanted to share it with the community. The problem: If you run local models, you know the pain. You load Llama 3, chat with it, then want to try Gemma or Qwen. That means unloading the current model, waiting 30-60 seconds for the new one to load, and repeating this cycle every single time. It breaks your flow and wastes a ton of time.
What SnapLLM does: It keeps multiple models hot in memory and switches between them in under 1 millisecond (benchmarked at ~0.02ms). Load your models once, then snap between them instantly. No more waiting. How it works: Built on top of llama.cpp and stable-diffusion.cpp Uses a vPID (Virtual Processing-In-Disk) architecture for instant context switching Three-tier memory management: GPU VRAM (hot), CPU RAM (warm), SSD (cold) KV cache persistence so you don't lose context
What it supports: Text LLMs: Llama, Qwen, Gemma, Mistral, DeepSeek, Phi, Unsloth AI models, and anything in GGUF format Vision models: Gemma 3 + mmproj, Qwen-VL + mmproj, LLaVA Image generation: Stable Diffusion 1.5, SDXL, SD3, FLUX via stable-diffusion.cpp OpenAI/Anthropic compatible API so you can plug it into your existing tools Desktop UI, CLI, and REST API
Model switch time between any of these: 0.02ms Getting started is simple: Clone the repo and build from source Download GGUF models from Hugging Face (e.g., gemma-3-4b Q5_K_M) Start the server locally Load models through the Desktop UI or API and point to your model folder Start chatting and switching
NVIDIA CUDA is fully supported for GPU acceleration. CPU-only mode works too.
With SLMs getting better every month, being able to quickly switch between specialized small models for different tasks is becoming more practical than running one large model for everything. Load a coding model, a medical model, and a general chat model side by side and switch based on what you need.
Ideal Use Cases: Multi-domain applications (medical + legal + general) Interactive chat with context switching Document QA with repeated queries On-Premise Edge deployment Edge devices like drones, self-driving vehicles, autonomous vehicles, etc Multi-agent workflow
Demo Videos: SnapLLM Desktop App Demo (Vimeo): https://vimeo.com/1157629276 SnapLLM Server and API Demo (Vimeo): https://vimeo.com/1157624031
The server demo walks through starting the server locally after cloning the repo, downloading models from Hugging Face, and loading them through the UI.
Links: GitHub: https://github.com/snapllm/snapllm Arxiv Paper: https://arxiv.org/submit/7238142/view
Star this repository - It helps others discover SnapLLM
MIT licensed. PRs and feedback welcome. If you have questions about the architecture or run into issues, drop them here or open a GitHub issue.