There are a number of properties of SLURM and other batch systems that are far more convenient for users. Typically a SLURM system will have a large distributed filesystem that can be accessed from all the nodes using normal UNIX commands (GFS and Colossus aren't mountable filesystems and the UNIX tools do not really work against them natively). Reading output logs is often much easier (tail -f, less, etc) but the cost of the filesystem (per byte) goes up exponentially as the number of nodes increases.
I have extensive experience with both systems. My experience with SLURM is that the system is highly predictable, slowly changing, and I can get my work done. Whereas on Google systems, everything was breaking constantly and I had to focus on making my systems resilient to noise. On the other hand, the Google system scaled far larger, for cheaper.
When I joined Google I asked Jeff Dean how page rank was computed and he said it was an iterative mapreduce- I was assuming it was some sort of tightly coupled supercomputer-style job, but for the size of the web, and the link structure, MR made much more sense.
"gang scheduling" according to the official docs: https://slurm.schedmd.com/gang_scheduling.html
-- maybe I've read the docs wrong the last decade using SLURM.
I'm using the term in its broader CS sense: all-or-nothing co-scheduling of related processes across multiple processors [1]. This is the definition used across the K8s ecosystem e.g. Volcano [2], Kueue [3], and its Coscheduling plugin all define gang scheduling as "all or nothing" allocation.
I still stand by the origianl claim:
Slurm allocates multi-node jobs atomically, while vanilla K8s doesn't. its default scheduler places pods as resources become available, leading to partial allocations and deadlocks for distributed training. It's just a terminology clash. Thanks for the comment anyway.
[1] https://en.wikipedia.org/wiki/Gang_scheduling [2] https://volcano.sh/en/docs/plugins/ [3] https://www.coreweave.com/blog/kueue-a-kubernetes-native-sys...