user:	waybarrios
created:	Apr 11, 2024
karma:	12
about:	Hey HN! I built vLLM-MLX because vLLM falls back to CPU-only mode on macOS,which is painfully slow on Apple Silicon machines. vLLM-MLX brings native GPU acceleration using Apple's MLX framework, with: `• OpenAI-compatible API (drop-in replacement) • Multimodal: Text, Images, Video, Audio in one server • Continuous batching for concurrent users (3.4x speedup) • TTS in 10+ languages (Kokoro, Chatterbox) • MCP tool calling support Performance on M4 Max: - Llama-3.2-1B-4bit: 464 tok/s - Qwen3-0.6B: 402 tok/s - Whisper STT: 197x real-time` Quick start: pip install -e . vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit Works with standard OpenAI SDK. Happy to answer questions! `GitHub: https://github.com/waybarrios/vllm-mlx`