23 pointsby jxmorris124 hours ago3 comments
  • umairnadeem1235 minutes ago
    good writeup but im curious about tail latency under mixed prompts. if one request has huge context and another is tiny, do you bucket by expected decode length or just fifo with continuous refill?

    also did you test fairness knobs? ive seen p95 improve while a few tenants get starved unless there is some aging policy.

  • charcircuit3 hours ago
    This article does not explain what happens if the multiple prompts need different experts. Does it try and schedule the maximum number experts into memory to try and run the maximum number of prompts at once? Scheduling gets very complicated and there are different trade offs around fairness of processing which prompts at which times.
  • asteroidburger3 hours ago
    How long until “first principles” is a meme like “considered harmful”? Or are we there already?