good writeup but im curious about tail latency under mixed prompts. if one request has huge context and another is tiny, do you bucket by expected decode length or just fifo with continuous refill?
also did you test fairness knobs? ive seen p95 improve while a few tenants get starved unless there is some aging policy.