Migrating from Slurm to Kubernetes(blog.skypilot.co)

7 pointsby rombr2 hours ago2 comments

dekhn31 minutes ago
Much of this stems from Google's original design for its search engine, and borg. Borg was not a batch scheduler, it was a service-keeper-upper with loose relationships between individual containers. Google's design was never friendly to MPI-style jobs (which long ago tended to be highly synchronous, with distributed processes that were extremely long-lived, and servers did not just "come and go").
There are a number of properties of SLURM and other batch systems that are far more convenient for users. Typically a SLURM system will have a large distributed filesystem that can be accessed from all the nodes using normal UNIX commands (GFS and Colossus aren't mountable filesystems and the UNIX tools do not really work against them natively). Reading output logs is often much easier (tail -f, less, etc) but the cost of the filesystem (per byte) goes up exponentially as the number of nodes increases.
I have extensive experience with both systems. My experience with SLURM is that the system is highly predictable, slowly changing, and I can get my work done. Whereas on Google systems, everything was breaking constantly and I had to focus on making my systems resilient to noise. On the other hand, the Google system scaled far larger, for cheaper.
When I joined Google I asked Jeff Dean how page rank was computed and he said it was an iterative mapreduce- I was assuming it was some sort of tightly coupled supercomputer-style job, but for the size of the web, and the link structure, MR made much more sense.
GuestFAUniverse2 hours ago
The author doesn't seem to now even the basics of SLURM?
"gang scheduling" according to the official docs: https://slurm.schedmd.com/gang_scheduling.html
-- maybe I've read the docs wrong the last decade using SLURM.
- alex000kiman hour ago
  Author here. I've seen the docs you linked to: Slurm uses "gang scheduling" to mean something specific (timesliced oversubscription where jobs alternate on shared resources).
  I'm using the term in its broader CS sense: all-or-nothing co-scheduling of related processes across multiple processors [1]. This is the definition used across the K8s ecosystem e.g. Volcano [2], Kueue [3], and its Coscheduling plugin all define gang scheduling as "all or nothing" allocation.
  I still stand by the origianl claim:
  Slurm allocates multi-node jobs atomically, while vanilla K8s doesn't. its default scheduler places pods as resources become available, leading to partial allocations and deadlocks for distributed training. It's just a terminology clash. Thanks for the comment anyway.
  [1] https://en.wikipedia.org/wiki/Gang_scheduling [2] https://volcano.sh/en/docs/plugins/ [3] https://www.coreweave.com/blog/kueue-a-kubernetes-native-sys...
  - GuestFAUniverse16 minutes ago
    Thanks for the clarification!
- covian hour ago
  The post says Slurm supports gang scheduling, k8s doesn't (out of the box).