Codex is notably higher quality but also has me waiting forever. Hopefully these small models get better and better, not just at benchmarks.
And while it usually leads to higher quality output, sometimes it doesn't, and I'm left with a bs AI slop that would have taken Opus just a couple of minutes to generate anyway.
This user has also done a bunch of good quants:
https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs
It isn't much until you get down to very small quants.
The flash model in this thread is more than 10x smaller (30B).
https://huggingface.co/models?other=base_model:quantized:zai...
Probably as:
issue to follow: https://github.com/ggml-org/llama.cpp/issues/18931
Also notice that this is the "-Flash" version. They were previously at 4.5-Flash (they skipped 4.6-Flash). This is supposed to be equivalent to Haiku. Even on their coding plan docs, they mention this model is supposed to be used for `ANTHROPIC_DEFAULT_HAIKU_MODEL`.
They also charge full price for the same cached tokens on every request/response, so I burned through $4 for 1 relatively simple coding task - would've cost <$0.50 using GPT-5.2-Codex or any other model besides Opus and maybe Sonnet that supports caching. And it would've been much faster.
1 million tokens per minute, 24 million tokens per day. BUT: cached tokens count full, so if you have 100,000 tokens of context you can burn those tokens in a few requests.
People talk about these models like they are "catching up", they don't see that they are just trailers hooked up to a truck, pulling them along.
This is a terrible "test" of model quality. All these models fail when your UI is out of distribution; Codex gets close but still fails.
And yet, in terms of coding performance (at least as measured by SWE-Bench Verified), it seems to be roughly on par with o3/GPT-5 mini, which would be pretty impressive if it translated to real-world usage, for something you can realistically run at home.
We’ve launched GLM-4.7-Flash, a lightweight and efficient model designed as the free-tier version of GLM-4.7, delivering strong performance across coding, reasoning, and generative tasks with low latency and high throughput.
The update brings competitive coding capabilities at its scale, offering best-in-class general abilities in writing, translation, long-form content, role play, and aesthetic outputs for high-frequency and real-time use cases.
https://docs.z.ai/release-notes/new-releasedGLM 4.7 is good enough to be a daily driver but it does frustrate me at times with poor instruction following.
This seems pretty darn good for a 30B model. That's significantly better than the full Qwen3-Coder 480B model at 55.4.
In my ime small tier models are good for simple tasks like translation and trivia answering, but are useless for anything more complex. 70B class and above is where models really start to shine.
I suppose Flash is merely a distillation of that. Filed under mildly interesting for now.
Also, according to the gpt-oss model card 20b is 60.7 (GLM claims they got 34 for that model) and 120b is 62.7 on SWE-Bench Verified vs GLM reports 59.7
ssh admin.hotaisle.app
Yes, this should be made easier to just get a VM with it pre-installed. Working on that.
It took me quite some time to figure the magic combination of versions and commits, and to build each dependency successfully to run on an MI325x.
Here is the magic (assuming a 4x)...
docker run -it --rm \
--pull=always \
--ipc=host \
--network=host \
--privileged \
--cap-add=CAP_SYS_ADMIN \
--device=/dev/kfd \
--device=/dev/dri \
--device=/dev/mem \
--group-add render \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-v /home/hotaisle:/mnt/data \
-v /root/.cache:/mnt/model \
rocm/vllm-dev:nightly
mv /root/.cache /root/.cache.foo
ln -s /mnt/model /root/.cache
VLLM_ROCM_USE_AITER=1 vllm serve zai-org/GLM-4.7-FP8 \
--tensor-parallel-size 4 \
--kv-cache-dtype fp8 \
--quantization fp8 \
--enable-auto-tool-choice \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--load-format fastsafetensors \
--enable-expert-parallel \
--allowed-local-media-path / \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 1 \
--mm-encoder-tp-mode data