3 pointsby raullen3 hours ago1 comment
  • raullen3 hours ago
    I've been working on a fork of vllm-mlx (OpenAI-compatible LLM server for Apple Silicon) to make it actually usable for coding agents. The upstream project is great but was missing production-grade tool calling, reasoning separation, and multi-turn performance.

      What I added (37 commits):
    
      - Tool calling that works — streaming + non-streaming, supports MiniMax and Hermes/Qwen3 formats. 4/4 accuracy on structured function calling benchmarks.
      - Reasoning separation — MiniMax-M2.5 mixes reasoning into its output with no tags. Built a heuristic parser that cleanly separates reasoning from content (0% leak rate, was 60%
       with the generic parser).
      - Prompt cache for SimpleEngine — persistent KV cache across requests. On 33K-token coding agent contexts: TTFT goes from 28s to 0.3s on cache hit. This is the single biggest
      improvement for multi-turn use.
      - 1500+ tests — parsers, engine, server, tool calling. The upstream had minimal test coverage.
    
      Benchmarks (Mac Studio M3 Ultra, 256GB):
    
      Qwen3-Coder-Next-6bit (80B MoE, 3B active):
      - Decode: 65 tok/s
      - Prefill: 1090-1440 tok/s
      - TTFT (cache hit, 33K context): 0.3s
    
      MiniMax-M2.5-4bit (229B MoE):
      - Decode: 33-38 tok/s
      - Deep reasoning with tool calling
    
      I built this to run OpenClaw locally on my Mac instead of paying for cloud APIs. Qwen3-Coder-Next at 65 tok/s with tool calling is genuinely usable — not a toy demo.
    
      Quick start:
    
      pip install git+https://github.com/raullenchai/vllm-mlx.git
      python -m vllm_mlx.server \
        --model lmstudio-community/Qwen3-Coder-Next-MLX-6bit \
        --tool-call-parser hermes --port 8000
    
      GitHub: https://github.com/raullenchai/vllm-mlx