14 pointsby cafkafk4 days ago3 comments
  • DiabloD3a day ago
    I suspect this person didn't read the README or the `--help` at all.

    For example, `--cpu-moe` makes it not offload MoE layers to the GPU, which drops performance about a quarter, but only keeps the dense and important layers on the GPU, allowing you to have MoE models bigger than VRAM almost for free, but also free up room in VRAM for more KV cache. It does nothing on CPU-only.

    `--no-kv-offload` also does nothing here: it makes it not offload KV cache to VRAM... he doesn't have a GPU to offload to, and this is the default there.

    Again, `-sm` is only for multi-GPU. No GPUs here.

    `--mla-use` is for models that use Deepseek's Multi-Head Latent Attention. Gemma 4 is not one of them.

    `--merge-up-gate-experts` reduces matrix math complexity around ffn_up and ffn_gate tensors; CPUs do not have tensor units and this is unlikely to actually help.

    MTP is also never faster on CPU-only, and this is documented. ngram-mod, however, may help, which it doesn't look like he tried.

    This whole screed also reads like it was written by AI.

  • usernamed7a day ago
    > I am telling you the count because the count is the point.

    > The honest caveat, because it matters:

    > This one I got right in the original, and now I have the number to back it.

    Thanks Claude.

    • dwrobertsa day ago
      What makes it particularly bad too, is it does a style of saying

      “The X was Y”

      for non-trivial concept X, without any previous attempt to introduce or explain what it is. It reads like it’s intentionally trying to bamboozle the reader

    • mmmpetrichora day ago
      AI didn't read. (AIDR?)
  • cafkafk4 days ago
    Hi HN. Follow-up to the Xeon post from a couple of weeks ago. A lot of people came away from that one with a 25-flag command and no real idea which flags materially did anything, and the honest truth is neither did I fully. So I went back and measured it, one flag off at a time, 174 server restarts, with the engine log as the source of truth instead of my own assumptions.

    This also is an attempt at starting to answer the "cool, but do you have numers for this" question, which is harder than one would assume!

    Two things I'll flag up front, because they're corrections to my own post: --spec-autotune, which I'd called the way to squeeze the most out of speculation, is actually the worst speculation setting I tested. Ouch.

    And --mla-use, which I presented as active, isn't even wired for Gemma 4 and gets silently ignored.

    More broadly, most of the config does nothing for a typical setup if you're not "holding it right". The flags that genuinely carry it (flash attention, the physical-core thread count, a fixed draft length) are a much shorter list than the command suggests, and the drafter turns out to be a win for code but a loss for summarization.

    And to be clear. The numbers are specific to this box, and none of it runs without the ik_llama.cpp fork I link. I'm still not an ML engineer, so if I've gotten something wrong I'd honestly rather hear it, that's the best kind of reply.

    The box is now back to being busy as a Nix cache once again now, so answers are best effort.

    (There's a... fifth(?!) post coming on benchmarking quantization, where the single fastest config I measured turned out to be pure garbage. But making it not garbage was useful. No spoilers!)