1 pointby czmilo19 days ago1 comment

czmilo19 days ago
Z.AI's GLM-4.7-Flash is a 30B parameter MoE model with only 3B active parameters per token, achieving 59.2% on SWE-bench Verified (vs 22% for Qwen3-30B, 34% for GPT-OSS-20B). Runs efficiently on consumer hardware: 24GB GPUs (RTX 3090/4090) or Mac M-series chips at 60-80+ tokens/second with 4-bit quantization.
The guide covers architecture details (MoE design, MLA attention for 200K context), benchmark comparisons, local deployment via vLLM/MLX/Ollama, API pricing ($0.07/$0.40 per 1M tokens), and real-world user feedback. Community reports highlight strong performance in UI generation and tool calling, though reasoning lags behind specialized models.
Open weights available on Hugging Face. Free API tier or completely offline deployment.