Lower quants like that affect its output, making it less capable overall and letting it easily forget things.
Here's what I'd do with 96GB of RAM: Run Qwen 3.6 35b-a3b at Q8 for coding/agentic tasks. You'll get around 70tokens generated per second, the prefill is lightning fast in comparison, and you'll get a lot of work done. Qwen 3.6 27b is out now too, and I'm getting 17tok/sec token generation with a slower prefill.
The upshot is that you'll still have 20-40GB of RAM left for your workstation and development loads. Running Qwen 3.6 35b or 27b at Q8 quantization, the model at 128k context uses about 40GB of RAM; my OS and application load uses 20-30GB most of the time, for a total of 60-70. That's plenty of room in memory for you to work _and_ run inference.
You _may_ end up getting Deepseek 4 Flash running, but it'll be a lower quantization like Q2 or Q3, making it kind of dumb in comparison. And you may not have enough memory left over for any appreciable amount of context. Working with today's reasoning models needs context for it to generate and give out good answers. Doubly so for agentic/coding tasks.
The M3 Ultra's GPU is a bit on the weak side for large-scale inference, so you'll be waiting on token prefill for most coding/agent workflows.
Have you tried any other models with this M3 Ultra?
Apple's GPUs are just not very fast for inference. I'd stick to the smaller 7b-18b parameter range or MOE models like Qwen if you want a usable inference speed.