2 pointsby namegulf7 hours ago2 comments
  • josefcub25 minutes ago
    I've got 256GB of RAM on a Mac Studio M3 Ultra. Other posters are right: The M3 Ultra's prefill is super slow with really large models, 3-5 minutes while it digests the new additions to its context before it continues. On my heavy RAM model, I _can_ run 400b-500b models at Q2, and up to about 750b models at Q1, but the wait isn't the worst part.

    Lower quants like that affect its output, making it less capable overall and letting it easily forget things.

    Here's what I'd do with 96GB of RAM: Run Qwen 3.6 35b-a3b at Q8 for coding/agentic tasks. You'll get around 70tokens generated per second, the prefill is lightning fast in comparison, and you'll get a lot of work done. Qwen 3.6 27b is out now too, and I'm getting 17tok/sec token generation with a slower prefill.

    The upshot is that you'll still have 20-40GB of RAM left for your workstation and development loads. Running Qwen 3.6 35b or 27b at Q8 quantization, the model at 128k context uses about 40GB of RAM; my OS and application load uses 20-30GB most of the time, for a total of 60-70. That's plenty of room in memory for you to work _and_ run inference.

    You _may_ end up getting Deepseek 4 Flash running, but it'll be a lower quantization like Q2 or Q3, making it kind of dumb in comparison. And you may not have enough memory left over for any appreciable amount of context. Working with today's reasoning models needs context for it to generate and give out good answers. Doubly so for agentic/coding tasks.

  • bigyabai7 hours ago
    It might run the smaller flash version, but 96gb is not enough for the trillion-parameter model.

    The M3 Ultra's GPU is a bit on the weak side for large-scale inference, so you'll be waiting on token prefill for most coding/agent workflows.

    • namegulf5 hours ago
      They have a 512gb ram option but pricey.

      Have you tried any other models with this M3 Ultra?

      • bigyabai5 hours ago
        The 512gb model would have to use a lobotomized quant like q_2 or q_1, and you would still be waiting 3-5 minutes to process context lengths in the 32,000-64,000 token range.

        Apple's GPUs are just not very fast for inference. I'd stick to the smaller 7b-18b parameter range or MOE models like Qwen if you want a usable inference speed.

        • namegulf4 hours ago
          Looks like that's a good idea for now. Yeah 3-5 mins is not practical use.

          Any thoughts on M5?

          They may be soon releasing a M5 model with mac studio/mini.

        • namegulf4 hours ago
          NVIDIA DGX Spark a good option?

          $4,699.00

          But looks like we may need a NVIDIA AI Enterprise - DGX Spark License