I've been working on a fork of vllm-mlx (OpenAI-compatible LLM server for Apple Silicon) to make it actually usable for coding agents. The upstream project is great but was
missing production-grade tool calling, reasoning separation, and multi-turn performance.
What I added (37 commits):
- Tool calling that works — streaming + non-streaming, supports MiniMax and Hermes/Qwen3 formats. 4/4 accuracy on structured function calling benchmarks.
- Reasoning separation — MiniMax-M2.5 mixes reasoning into its output with no tags. Built a heuristic parser that cleanly separates reasoning from content (0% leak rate, was 60%
with the generic parser).
- Prompt cache for SimpleEngine — persistent KV cache across requests. On 33K-token coding agent contexts: TTFT goes from 28s to 0.3s on cache hit. This is the single biggest
improvement for multi-turn use.
- 1500+ tests — parsers, engine, server, tool calling. The upstream had minimal test coverage.
Benchmarks (Mac Studio M3 Ultra, 256GB):
Qwen3-Coder-Next-6bit (80B MoE, 3B active):
- Decode: 65 tok/s
- Prefill: 1090-1440 tok/s
- TTFT (cache hit, 33K context): 0.3s
MiniMax-M2.5-4bit (229B MoE):
- Decode: 33-38 tok/s
- Deep reasoning with tool calling
I built this to run OpenClaw locally on my Mac instead of paying for cloud APIs. Qwen3-Coder-Next at 65 tok/s with tool calling is genuinely usable — not a toy demo.
Quick start:
pip install git+https://github.com/raullenchai/vllm-mlx.git
python -m vllm_mlx.server \
--model lmstudio-community/Qwen3-Coder-Next-MLX-6bit \
--tool-call-parser hermes --port 8000
GitHub: https://github.com/raullenchai/vllm-mlx