vLLM-MLX brings native GPU acceleration using Apple's MLX framework, with:
• OpenAI-compatible API (drop-in replacement)
• Multimodal: Text, Images, Video, Audio in one server
• Continuous batching for concurrent users (3.4x speedup)
• TTS in 10+ languages (Kokoro, Chatterbox)
• MCP tool calling support
Performance on M4 Max:
- Llama-3.2-1B-4bit: 464 tok/s
- Qwen3-0.6B: 402 tok/s
- Whisper STT: 197x real-time
Quick start:
pip install -e .
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bitWorks with standard OpenAI SDK. Happy to answer questions!