Metaflow: Build, Manage and Deploy AI/ML Systems(github.com)

123 pointsby plokker7 months ago8 comments

vtuulos7 months ago
I don't know if it's a coincidence but we just released a major new feature in Metaflow a few days ago - composing flows with custom decorators: https://docs.metaflow.org/metaflow/composing-flows/introduct...
A big deal is that they get packaged automatically for remote execution. And you can attach them on the command line without touching code, which makes it easy to build pipelines with pluggable functionality - think e.g. switching an LLM provider on the fly.
If you haven't looked into Metaflow recently, configuration management is another big feature that was contributed by the team at Netflix: https://netflixtechblog.com/introducing-configurable-metaflo...
Many folks love the new native support for uv too: https://docs.metaflow.org/scaling/dependencies/uv
I'm happy to answer any questions here
- theOGognf7 months ago
  Is it common to see Metaflow used alongside MLflow if a team wants to track experiment data?
  - vtuulos7 months ago
    Metaflow tracks all artifacts and allows you to build dashboards with them, so there’s no need to use MLFlow per se. There’s a Metaflow integration in Weights and Biases, CometML etc, if you want pretty off-the-shelf dashboards
awgl7 months ago
I've used Metaflow for the past 4 years or so on different ML teams. It's really great!
Straightforward for data/ML scientists to pick up, familiar python class API for defining DAGs, and simplifies scaling out parallel jobs on AWS Batch (or k8s). The UI is pretty nice. Been happy to see the active development on it too.
Currently using it at our small biotech startup to run thousands of protein engineering computations (including models like RFDiffusion, ProteinMPNN, boltz, AlphaFold, ESM, etc.).
Data engineering focused DAG tools like Airflow are awkward for doing these kinds of ML computations, where we don't need the complexity of schedules, etc. Metaflow, imho, is also a step up from orchestration tools that were born out of bioinformatics groups, like Snakemake or Nextflow.
Just a satisfied customer of Metaflow here. thx
- Bukhmanizer7 months ago
  If you’ve tried, has it been clunky to run non-python based workflows? I.e if you want to run bedtools or diamond without having to run a bunch of subprocess.run commands?
  - awgl7 months ago
    Right, for most of our workflows, we stay in python land, which is great and seamless with Metaflow being in python. But yes, there are occasions that we have to make a system call to run an old R script or even a compiled C++ executable :shrug: (Metaflow does have some native R support tho) I have not had to use the specific tools you called out, bedtools or diamond.
    Most of the time this not a blocking problem since each step in a flow is mapped to a Docker image and/or your choice of EC2 instance (e.g. one step on a GPU, another on a memory optimized instance). You can have one step use an image with all of your python-based ML stuff, and another step have a different image with compiled exectuables that are triggered by a system call. If needed, outputs from such a system call would then need to be persisted in a database/S3 or read back into the python flow for persistence. So, it is not as seamless as a flow in all python, but it can work "good enough".
anentropic7 months ago
I've been curious about this project for a while...
If you squint a bit it's sort of like an Airflow that can run on AWS Step Functions.
Step Functions sort of gives you fully serverless orchestration, which feels like a thing that should exist. But the process for authoring them is very cumbersome - they are crying out for a nice language level library i.e. for Python something that creates steps via decorator syntax.
And it looks like Metaflow basically provides that (as well as for other backends).
The main thing holding me back is lack of ecosystem. A big chunk of what I want to run on an orchestrator are things like dbt and dlt jobs, both of which have strong integrations for both Airflow and Dagster. Whereas Metaflow feels like not really on the radar, not widely used.
Possibly I have got the wrong end of the stick a bit because Metaflow also provides an Airflow backend, which I sort of wonder in that case why bother with Metaflow?
- vtuulos7 months ago
  Metaflow was started to address the needs of ML/AI projects whereas Airflow and Dagster started in data engineering.
  Consequently, a major part of Metaflow focuses on facilitating easy and efficient access to (large scale) compute - including dependency management - and local experimentation, which is out of scope for Airflow and Dagster.
  Metaflow has basic support for dbt and companies use it increasingly to power data engineering as AI is eating the world, but if you just need an orchestrator for ETL pipelines, Dagster is a great choice
  If you are curious to hear how companies navigate the question of Airflow vs Metaflow, see e.g this recent talk by Flexport https://youtu.be/e92eXfvaxU0
- kot-behemoth7 months ago
  A while ago I saw a promising Clojure project stepwise [0] which sounds pretty close to what you're describing. It not only allows you to define steps in code, but also implements cool stuff like ability to write conditions, error statuses and resources in a much-less verbose EDN instead of JSON. It also supports code reloading and offloading large payloads to S3.
  Here's a nice article with code examples implementing a simple pipeline: https://www.quantisan.com/orchestrating-pizza-making-a-tutor....
  [0]: https://github.com/Motiva-AI/stepwise
  - spieden7 months ago
    Wow cool, a project I created got a mention on HN. :D
- coredog647 months ago
  A few years back, the Step Functions team was soliciting input, and the Python thing was something that came up as a suggestion. It's hard, yes, but it should be possible to "Starlark" this and tell users that if you stick to this syntax, you can write Python and compile it down to native StepFunction syntax.
  Having said that, they have slightly improved the StepFunctions by adopting JSONata syntax.
  - anentropic7 months ago
    I don't think it should need Starlark or a restricted syntax.
    You just want some Python code that builds up a representation of the state machine, e.g. via decorating functions the same way that Celery, Dask, Airflow, Dagster et al have done for years.
    Then you have some other command to take that representation and generate the actual Step Functions JSON from it (and then deploy it etc).
    But the missing piece is that those other tools also explicitly give you a Python execution environment, so the function you're decorating is usually the 'task' function you want to run remotely.
    Whereas Step Functions doesn't provide compute itself, it mostly just gives you a way to execute AWS API calls. But the non control flow tasks in my Step Functions end up mostly being Lambda invoke steps to run my Python code.
    I'm currently authoring Step Functions via CDK. It is clunky AF.
    What it needs is some moderately opinionated layer on top.
    Someone at AWS did have a bit of an attempt here: https://aws-step-functions-data-science-sdk.readthedocs.io/e... but I'd really like to see something that goes further and smooths away a lot of the finickety JSON input arg/response wrangling. Also the local testing story (for Step Functions generally) is pretty meh.
    vtuulos7 months ago
    If you are ok with executing your SFN steps on AWS Batch, Metaflow should do the job well. It's pretty inhuman to interact with SFN directly.
    One feature that's in our roadmap is the ability to define DAG fully programmatically, maybe through configs, so you will be able to have a custom representation -> SFN JSON, just using Metaflow as a compiler
nxobject7 months ago
As a fun historical sidebar and an illustration that there are no new names in tech these days, Metaflow was also the name of the company that first introduced out-of-order speculative execution of CISC architectures using micro-ops. [1]
[1] https://en.wikipedia.org/wiki/Metaflow_Technologies
lazarus017 months ago
I went to the GitHub page. The descriptions of the service seem redundant to what cloud providers offer today. I looked at the documentation and it lacks concrete examples for implementation flows.
Seems like something new to learn, an added layer on top of existing workflows, with no obvious benefit.
- vibecodemaster7 months ago
  > redundant to what cloud providers offer today
  It may look redundant on the surface, but those cloud services are infrastructure primitives (compute, storage, orchestration). Metaflow sits one layer higher, giving you a data/model centric API that orchestrates and versions the entire workflow (code, data, environment, and lineage) while delegating the low‑level plumbing to whatever cloud provider you choose. That higher‑level abstraction is what lets the same Python flow run untouched on a laptop today and a K8s GPU cluster tomorrow.
  > Adds an extra layer to learn
  I would argue that it removes layers: you write plain Python functions, tag them as steps, and Metaflow handles scheduling, data movement, retry logic, versioning, and caching. You no longer glue together five different SDKs (batch + orchestration + storage + secrets + lineage).
  > lacks concrete examples for implementation flows
  there are examples in the tutorials: https://docs.outerbounds.com/intro-tutorial-season-3-overvie...
  > with no obvious benefit
  There are benefits, but perhaps they're not immediately obvious:
  1) Separation of what vs. how: declare the workflow once; toggle @resources(cpu=4,gpu=1) to move from dev to a GPU cluster—no YAML rewrites.
  2) Reproducibility & lineage: every run immutably stores code, data hashes, and parameters so you can reproduce any past model or report with flow resume --run-id.
  3) Built‑in data artifacts: pass or version GB‑scale objects between steps without manually wiring S3 paths or serialization logic.
- manojlds7 months ago
  It's an old project from before the current AI buzz and I rejected this when I looked at it few years back as well with similar reasons.
  My opinion about Netflix OSS has been pretty low as well.
- datadrivenangel7 months ago
  All the cloud providers have some hosted / custom version of an AI/ML deployment and training system. Good enough to use, janky enough to probably not meet all your needs if you're serious.
  - lazarus017 months ago
    I use google cloud for ML. AWS has a similar offering.
    I find google is purpose built for ml and provides tons of resources with excellent documentation.
    AWS feels like driving a double decker bus, very big and clunky, compared to google, which is a luxury sedan, that is quite comfortable to take you where you’re going.
ShamblingMound7 months ago
Have been looking for an orchestrator for AI workflows including agentic workflows and this seemed to be the most promising (open source, free, can self-host, and supports dynamic workflows).
But have not seen anyone talk about it in that context. What do people use for AI workflow orchestration (aside from langchain)?
- vtuulos7 months ago
  Stay tuned! We have some cool new features coming soon to support agentic workloads (teaser: https://github.com/Netflix/metaflow/pull/2473)
  If you are curious, join the Metaflow Slack at http://slack.outerbounds.co and start a thread on #ask-metaflow
LaserToy7 months ago
Cloudkitchens use them as well: https://techblog.cloudkitchens.com/p/ml-infrastructure-doesn...
They call it a DREAM stack (Daft, Ray Engine or Ray and Poetry, Argo and Metaflow)
- vibecodemaster7 months ago
  There's actually a lot of companies using Metaflow, big and small: https://outerbounds.com/stories
apwell237 months ago
Netflix used to release so much good opensource software a decade ago. Now it seems to have fallen out of developer mindshare. Seems like the odd one out in FAANG in terms of tech and AI.