Nextflow and Snakemake are the two most-used options in bioinformatics these days, with WDL trailing those two.
I really wish Nextflow was based on Scala and not Groovy, but so it goes.
There is a Draft up for dsl3 that adds static types to the channels that I’m very excited about. https://github.com/nf-core/fetchngs/pull/309
I have a dislike of nextflow because it submits 10s of thousands of separate jobs to our HPC scheduler which causes a number of issues, though they've now added support for array jobs which should hopefully solve that.
We opted for implementing all of this in Go in SciPipe, where we get similar basic dataflow/flow-based functionality as Nextflow with the native concurrency primitives of Go, but the Go syntax probably/surely puts away some biologists who have written some python at most before, and Go won't let us customize the API and hide away as much of the plumbing under nice syntax, as Groovy.
In this regard, Groovy with the GPars library for the concurrency, doesn't seem as a particularly bad choice. There weren't that many options at the time either.
The downside has been tooling support though, such as editor intelligence and debugging support, although parts of that is finally improving now with a NF language server.
Today, one could probably implement something similar with Python's asyncio and queues for the channel semantics, and there is even the Crystal language that has Go-like concurrency in a much more script-like language (see a comparison between Go and Crystal concurrency syntax at [1]), but Crystal would of course be an even more fringe langauge than Groovy.
[1] https://livesys.se/posts/crystal-concurrency-easier-syntax-t...
For industrial purposes, I've started to approach these pipelines as a special case of feature extraction and so I'm reusing our ML infrastructure as much as possible.
Why did you rule out Nextflow or Snakemake? I believe they both work with k8 clusters.
Argo doesn’t look great from my standpoint as a workflow author.
YAML does leave a lot to be desired, but it also forces a degree of simplicity in architecting the pipeline because to do otherwise is too cumbersome. I really liked WDL as a language when I used to use that--seemed to have a nice balance of readability and simplicity. I believe Dyno created a python SDK for the Argo YAML syntax, and I need to look into that more.
For example, snakemake makes it very difficult if not impossible to create pipelines that deviate from a DAG architecture. In cases where you need loops, conditionals and so on, Nextflow is a better option.
One thing that I didn't like about nextflow is that all processes can either run under apptainer or docker, you can mix and match docker/apptainer like you do in snakemake rules.
1. A process that "generates" protein sequences
2. A collection of processes that perform computationally intensive downstream computations
3. A filter that decides, based on some calculation an a threshold whether the output from process (1) should move to process (2).
Furthermore, assume you'd like process (1) to continue generating new candidates continously and independently until N number of candidates pass the filter for downstream processing.
That's not something that you can do easily with snakemake since it generates the DAG before computation starts. Sure, you can create some hack or use checkpoints that forces snakemake to reevaluate the DAG and so on, and maybe --keep-going=true so that it won't end the other processes from failing, but with nextflow you just set a few channels as queues and connect them to processes, which is much easier.
That said, I do find snakemake easier to prototype with. And it also has plenty of production features (containers, cloud, etc). For many use cases, they’re functionally equivalent
Technically snakemake can do it all. But in practice NF seems to scale up a bit better.
That said, if you don’t need the UI for scientists, I’d stick to snakemake.