I use Qwen 3.6 27B, the dense version of this model which is slightly better.
I don't agree that it's close at all. Maybe for some small, easy tasks, but not for working on real codebases. It's amazing for something I can run at home, but the difference between it and Opus or GPT-5.5 is huge.
Personally I am using `--no-context-shift` and feeding in context back in on failure at the harness level.
I have 2x1080ti + 1xTitanV that have a full 262,144 tokens context on 262,144 tokens with `-sm tensor` at 62.04 t/s which isn't so bad.
But I also have a 1x3090 running unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL at 41.89 t/s but with only 130k context, but if you have a modular programming style both work pretty well.
But play with YaRN if you really need it.
[0]https://qwen.readthedocs.io/en/v3.0/run_locally/llama.cpp.ht...
HEre's my setup:
llama-server
--port 9999
--model /MODELS/LLMs/Qwen3.6-27B-UD-Q4_K_XL.gguf
--ctx-size 128000
--threads 12
--flash-attn on
--device CUDA0
--jinja
--gpu-layers 52
--mmproj /MODELS/LLMs/Qwen3.6-27B-mmproj-F16.gguf
--cache-type-k q8_0
--cache-type-v q8_0
--temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0 --presence-penalty 0.0
--spec-type draft-mtp --spec-draft-n-max 2
(I'm not filling out 100% of the VRAM, as I have other stuff I need it for.)Ya, if you are using the CPU it may slowdown quick.
This may be a bit huge and overcomplicated, on this host I am running it on a AMD Ryzen 7 5700G so that I can use the APU to dedicate the 3090.
podman run --device nvidia.com/gpu=all -d -v llama_qwen3.6mpt:/root/.cache -p 8080:8080 local/llama.cpp:full-cuda --server \
-hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
-ngl 99 \
--ctx-size 131072 \
--no-mmproj-offload \
--no-context-shift \
--kv-unified \
--spec-type draft-mtp \
--spec-draft-n-max 6 \
--spec-draft-p-min 0.75 \
-fa on --jinja --no-mmap \
--cache-ram -1 \
--no-warmup -np 1 \
-n 32768 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--temp 0.6 \
--min-p 0.00 \
--top-k 20 \
--top-p 0.95 \
--presence-penalty 0.0 \
--repeat-penalty 1.05 \
--fit off \
--reasoning on \
--chat-template-kwargs '{"preserve_thinking":true}' \
--prio 3 \
--poll 100 \
--port 8080 \
--host 0.0.0.0
I am just building the container with: podman build -t local/llama.cpp:full-cuda --target full -f .devops/cuda.Dockerfile .
And here is the logs from a 'make me a flappy bird program in python' webui prompt. prompt eval time = 105.86 ms / 19 tokens ( 5.57 ms per token, 179.47 tokens per second)
eval time = 100549.41 ms / 4608 tokens ( 21.82 ms per token, 45.83 tokens per second)
total time = 100655.28 ms / 4627 tokens
draft acceptance rate = 0.47215 ( 3408 accepted / 7218 generated)
I am down to ~25.54 t/s with a 95% full context.I think that was all about some earlier crashes.
podman run --device nvidia.com/gpu=all -d -v llama_qwen3.6mpt:/root/.cache -p 8080:8080 local/llama.cpp:full-cuda --server \
-hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
-ngl 99 \
--ctx-size 128000 \
--no-mmproj-offload \
--no-context-shift \
--kv-unified \
--spec-type draft-mtp \
--spec-draft-n-max 2 \
--spec-draft-p-min 0.75 \
-fa on --jinja --no-mmap \
--cache-ram -1 \
--no-warmup -np 1\
-n 32768 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--temp 0.6 \
--min-p 0.00 \
--top-k 20 \
--top-p 0.95 \
--presence-penalty 0.0 \
--repeat-penalty 1.05 \
--fit off \
--reasoning on \
--chat-template-kwargs '{"preserve_thinking":true}' \
--port 8080 \
--host 0.0.0.0But you can increase your context window for the same VRAM by quantizing the KV cache with FP8 (double the context) or TurboQuant (more than double)[1].
0: https://medium.com/@leannetan/extending-context-length-with-...
1: https://docs.vllm.ai/en/latest/features/quantization/quantiz...
You don't pick just one model to "work on real codebases". You use a very advanced model to plan, and a not-very-advanced, cheaper, faster model to execute planned tasks. This saves money and speeds up work. This is the guidance from Anthropic & OpenAI.
It works really well for me, at least for Python and JavaScript, with swival.dev as a harness.
But jokes aside, I love the fast iteration, these are most probably again finetunes on the 3.5 architecture that appear better in internal testing, which is still very nice to see. Putting more and more pressure on the bigger labs to perform better is always a good thing.
Companies or labs like deepseek that produce less but larger and more innovative models, so seem to be more research oriented.
then there are companies like z.ai (GLM), Minimax, and Qwen which focus more on commercializing the AI and so produce far more versions, but with far less improvements between them (usually fine tunes)
Commercial providers like anthropic probably do the same thing, maybe even without labeling it like a different version if the model is similiar enough.
Maybe nothing released to the public. I don't know that all of their models are public. I think all they really care about is that they aren't relying on one or two cloud providers for a critical piece of their infrastructure.
Ideally I would love to see a leaderboard with relatively objective ranking criteria that 1. lets you filter by open weight / locally runnable, 2. filter by date of release (nothing older than x), and 3. is agnostic to hardware requirements. I just want to know what the best model is. Let me worry about how I will afford to run it.
I love the llmfit project for seeing what will run on your hardware, but it would be nice to know what I'm missing out on by not having better hardware, thus why objective hardware-agnostic ratings would be helpful.
Any open benchmark has a very short life, since it will be pulled in and DPO / RL trained quickly for benchmaxxing purposes. So, you'll need a private test to have a hope of something fair. (These also get leaked over time, btw, so even then there's a window of usability).
These are expensive to run.
Now consider that there might be 15-20 viable quants for a given open model release; someone would have to want to pay for these private evals to be run on them. Even then, a good read through unsloth's commits and blog posts will remind you that there's quite a lot of engineering work to be done to get model inference working properly, even for models released by frontier or near-frontier labs. So, you'd want to make sure that you have a replicable 'best engineered' deployment to evaluate, or at least one that's closest to your hardware and fits the bill.
Upshot - it's much faster to download and try out a model, and possibly cheaper too. Well, cheaper since hugging face is paying the bandwidth bills.
This is a very typical manager question that I suppose many people have who fail to see the simple truth: There is no "best" model. There are only best models for certain use-cases. Sometimes you'll find these in custom community leaderboards on platforms like huggingface, but for most business applications you'll probably have to come up with your own benchmark. Most common benchmarks are pretty worthless by now because all the usual ones are being gamed hard by model providers, to the point that there are now sometimes drastic differences between models that perform very similarly on common benchmarks.
It is a day/weekend worth of work, but I think this is the best way to determine if the model fits your need specifically. This is what lead me to finding out that qwen 27b outperformed even kimi on those tasks, and that opus tries gaslighting me when I give it a spec of something that has been proven, but no published solution exists online. All other models gave their best shot at solving it, opus just said it's not possible (even when I gave it the finished working product that obviously works).
Especially for small models (but also big ones) I think the only way to know if a model will improve your workflow is this, personal benchmarks, expanded over time, ran in private.
https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates
Maybe will help you?
OpenCode is pretty good too
> Qwen3.7 Preview lands on Arena !
> Here come Qwen3.7-Max-Preview & Qwen3.7-Plus-Preview. Alibaba now #6 lab in Text, #5 in Vision.
> Can't wait to release Qwen3.7 series models!Stay tuned! @arena
Also, a big caveat in using Qwen models has always been its speech patterns. I do wonder how Google made the Gemma lineup so good at this
Let's hope Alibaba continues to open source its models
Qwen 3.5/3.6 are far better at vision. Even the 9B model beats Gemma 4 31B in my use case. They describe the scene more accurately and they focus on the important elements like a human would.
Gemma 4 frequently misses important element, doesn't understand what things are, and is very coy even if you ask for lots of detail. You have to give it hints "hey what's that round thing on the left" to get half decent answers.
(Yes I did set the min-tokens correctly. I also tested bf16 and Q8 to make sure it wasn't a quant issue.)
It's unfortunate because Gemma 4 is so so so much better at natural language interactions.
Can you give an example? And/or is there a benchmark specifically for this?
At least for now. Worried the Chinese team will change their mind once they have parity
Right now they want to prevent the US labs from gaining any sort of self-reinforcing oligopoly on the space, and to let the ecosystem in China flourish.
That will all die sooner or later.
¹: I think I read this a couple of times but I'm not sure if correct to begin with. Can this be substantiated based on annual financial reporting or other published business metrics by OpenAI, Anthropic et al.?