2 pointsby waybarrios3 hours ago1 comment
  • waybarrios3 hours ago
    Hey HN! I built vLLM-MLX alike framework on macOS, which is painfully slow on Apple Silicon machines.

    vLLM-MLX brings native GPU acceleration using Apple's MLX framework, with:

      • OpenAI-compatible API (drop-in replacement)
      • Multimodal: Text, Images, Video, Audio in one server
      • Continuous batching for concurrent users (3.4x speedup)
      • TTS in 10+ languages (Kokoro, Chatterbox)
      • MCP tool calling support
    
      Performance on M4 Max:
      - Llama-3.2-1B-4bit: 464 tok/s
      - Qwen3-0.6B: 402 tok/s
      - Whisper STT: 197x real-time
    
    Quick start: pip install -e . vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit

    Works with standard OpenAI SDK. Happy to answer questions!

    GitHub: https://github.com/waybarrios/vllm-mlx