How I solved PyTorch's cross-platform nightmare(svana.name)

73 pointsby msvana5 days ago12 comments

lynndotpy2 days ago
> Setting up a Python project that relies on PyTorch, so that it works across different accelerators and operating systems, is a nightmare.
I would like to add some anecdata to this.
When I was a PhD student, I already had 12 years of using and administrating Linuxes as my personal OS, and I'd already had my share of package manager and dependency woes.
But managing Python, PyTorch, and CUDA dependencies were relatively new to me. Sometimes I'd lose an evening here or there to something silly. But I had one week especially dominated by these woes, to the point where I'd have dreams about package management problems at the terminal.
They were mundane dreams but I'd chalk them up as nightmares. The worst was having the pleasant dream where those problems went away forever, only to wake up to realize that was not the case.
- dleeftink2 days ago
  Wake up, lynndotpy
  - gchamonlive2 days ago
    Follow the white rabbit.
- godelskia day ago
  > When I was a PhD student, I already had 12 years of using and administrating Linuxes as my personal OS, and I'd already had my share of package manager and dependency woes.
  I'm in a very similar boat (just defended a few months ago).
  More than once I had installed pytorch into a new environment and subsequently spent hours trying to figure out why things suddenly aren't working. Turns out, PyTorch had just uploaded a bad wheel.
  Weirdly I feel like CUDA has become easier yet Python has become worse. It's all package management. Honestly, I find myself wanting to use package managers less and less because of Python. Of course `pip install` doesn't work, and that is probably a good thing. But the result of this is that any time you install a package it adds the module as a system module, which I thought was the whole thing we were trying to avoid. So what? Do I edit every package build now so that it runs a uv venv? If I do that, then this seems to just get more complicated as I have to keep better track of my environments. I'd rather be dealing with environment modules than that. I'd rather things be rapped up in a systemd service or nspawn than that!
  I mean I just did a update and upgrade and I had 13 python packages and 193 haskell modules, out of 351 packages! This shit is getting insane.
  People keep telling me to keep things simple, but I don't think any of this is simple. It really looks like a lot of complexity created by a lot of things being simplified. I mean isn't every big problem created out of a bunch of little problems? That's how we solve big problems -- break them down to small problems -- right? Did we forget the little things matter? If you don't think they do, did you question if this comment was written by an LLM because I used a fucking em dash? Seems like you latched onto something small. I think it is hard to know when the little things matter or don't matter, often we just don't realize the little things are part of the big things.
- aitchnyu2 days ago
  How well do you read in your dreams? Do you read full outputs or just diffrentiate between a green [OK] status and stack traces?
  - lynndotpya day ago
    I don't recall the details but I do remember having to write down details by hand.
    But the point is more that, for me, this is a somewhat rare instance where I think using the term "nightmare" in the title is justified.
  - godelskia day ago
    I can read perfectly well in my dreams. Like the letters are sharp, clear, and perfectly legible. The problem is when I look away from something and then look back the text usually changes. Once I lucid dreamed because I walked past a street sign, realized it was the name of a different street than the one I was on, looked back and say the other side of the sign read a third street name, walked back to the other side and saw a fourth name. I decided I should take this opportunity and be cliche and try to fly. I just kept going up till it was really bright and I woke up. Mostly now I just recognize I'm in a dream and go along for the ride, but better able to remember it.
di2 days ago
Note that https://peps.python.org/pep-0440/#direct-references says:
> Public index servers SHOULD NOT allow the use of direct references in uploaded distributions. Direct references are intended as a tool for software integrators rather than publishers.
This means that PyPI will not accept your project metadata as you currently have it configured. See https://github.com/pypi/warehouse/issues/7136 for more details.
- doctorpangloss2 days ago
  Guess the guy who wrote this article will learn the hard way: The last 20% of packaging is 800% of your time.
mdaniel2 days ago
> Cross-Platform
```
  cpu = [
  "torch @ <https://download.pytorch.org/whl/cpu/torch-2.7.1%2Bcpu-cp312-cp312-manylinux_2_28_x86_64.whl> ; python_version == '3.12'",
  "torch @ <https://download.pytorch.org/whl/cpu/torch-2.7.1%2Bcpu-cp313-cp313-manylinux_2_28_x86_64.whl> ; python_version == '3.13'",
  ]
```
:-/ It reminds me of Microsoft calling their thing "cross platform" because it works on several copies of Windows
In all seriousness, I get the impression that pytorch is such a monster PITA to manage because it cares so much about the target hardware. It'd be like a blog post saying "I solved the assembly language nightmare"
- gobdovan2 days ago
  Torch simply has to work this way because it cares about performance on a combination of multiple systems and dozens of GPUs. The complexity leaks into packaging.
  If you do not care about performance and would rather have portability, use an alternative like tinygrad that does not optimize for every accelerator under the sun.
  This need for hardware-specific optimization is also why the assembly language analogy is a little imprecise. Nobody expects one binary to run on every CPU or GPU with peak efficiency, unless you are talking about something like Redbean which gets surprisingly far (the creator actually worked on the TensorFlow team and addressed similar cross-platform problems).
  So maybe the the blogpost you're looking for is https://justine.lol/redbean2/.
  - dragonwriter2 days ago
    > Torch simply has to work this way because it cares about performance on a combination of multiple systems and dozens of GPUs
    Or, looked at a different way, Torch has to work this way because Python packaging has too narrow of an understanding of platforms which treats many things that are materially different platforms as the same platform.
- cstrahan2 days ago
  I think a more charitable interpretation of TFA would be: "I Have Come Up With A Recipe for Solving PyTorch's Cross-Platform Nightmare"
  That is: there's nothing stopping the author from building on the approach he shares to also include Windows/FreeBSD/NetBSD/whatever.
  It's his project (FileChat), and I would guess he uses Linux. It's natural that he'd solve this problem for the platforms he uses, and for which wheels are readily available.
- esafak2 days ago
  https://github.com/pypa/manylinux is for building cross-platform wheels.
  - mdaniel2 days ago
    > Python wheels that work on any linux (almost)
    So, you're doubling down on OP's misnomer of "cross platform means whatever platforms I use", eh?
    esafak2 days ago
    I do not know what you are objecting to. I have successfully packaged wheels for MacOS and Ubuntu with manylinux. I presume it works with some others too, but that is what I can personally attest to. Alpine is catered to by the musllinux project.
    You should be specific about which distributions you have in mind.
kwon-young2 days ago
In my opinion, anything that touch compiled packages like pytorch should be packaged with conda/mamba on conda-forge. I found it is the only package manager for python which will reliably detect my hardware and install the correct version of every dependency.
- zbowling2 days ago
  Try pixi! Pixi is a much more sane way for building with conda + pypi packages in a single tool that makes this so much easier for torch development, regardless if you get the condaforge or pypi builds of pytorch. https://pixi.sh/latest/
  - kwon-younga day ago
    I don't see the advantage ?
    In the comparative table, they claim that conda doesn't support:
    * lock file: which is false, you can freeze your environment
    * task runner: I don't need my package manager to be a task runner
    * project management: You can do 1 env per project ? I don't see the problem here...
    So no, please, just use conda/mamba and conda-forge.
- levocardia2 days ago
  Likewise, this was my experience. If ever I need to "pip anything" I know I'm in for a bad time. Conda is built for literally this exact problem. Still not a breeze, but much better than trying to manually freeze all your pip dependencies.
cmdr22 days ago
https://pypi.org/p/torchruntime might help here, it's designed precisely for this purpose.
`pip install torchruntime`
`torchruntime install torch`
It figures out the correct torch to install on the user's PC, factoring in the OS (Win, Linux, Mac), the GPU vendor (NVIDIA, AMD, Intel) and the GPU model (especially for ROCm, whose configuration varies per generation and ROCm version).
And it tries to support quite a number of older GPUs as well, which are pinned to older versions of torch.
It's used by a few cross-platform torch-based consumer apps, running on quite a number of consumer installations.
arun-mani-ja day ago
This is so nice, I wish more packages followed something like this. I'm on AMD integrated GPU (doesn't even support Rocm). Whenever I install a Python package that depends on PyTorch, it automatically installs some GBs of CUDA related packages.
This ends up wasting space and slowing down installation :(
Speaking of PyTorch and CUDA, I wish the Vulkan backend becomes stable, but that seems to super far dream...
https://docs.pytorch.org/executorch/stable/backends-vulkan.h...
zbowling2 days ago
Check out Pixi! Pixi is an alternative to the common conda and pypi frontends and has better system for hardware feature detection and get the best version of Torch for your hardware that is compatible across your packages (except for AMD at the moment). It can pull in the condaforge or pypi builds of pytorch and help you manage things automagically across platforms. https://pixi.sh/latest/python/pytorch/
It doesn't solve how you package your wheels specifically, that problem is still pushed on your downstream users because of boneheaded packaging decisions by PyTorch themselves but as the consumer, Pixi soften's blow. The condaforge builds of PyTorch also are a bit more sane.
ashvardaniana day ago
Related, but wasn’t broadly discussed on HN: https://astral.sh/blog/wheel-variants
tuna742 days ago
Is there a problem using distro packages for Pytorch? What are the downsides of using the official Fedora Pytorch for example?
antimora2 days ago
Check out https://github.com/tracel-ai/burn project! It makes deploying models across different platforms easy. It uses Rust instead of Python.
- liuliua day ago
  The reason why people go distances to package PyTorch is because the skill of translating models between different frameworks manually is "easy" but not well dispensed in developer community.
  That's why people will go stupid lengths to convert model from PyTorch / TensorFlow with onnxtools / coremltools to avoid touch the model / weights themselves.
  The only one that escaped this is llama.cpp, which weirdly, despite the difficulty of model conversion with ggml, people seem to do it anyway.
userabchn2 days ago
I maintain a package that provides some PyTorch operators that are written in C/C++/CUDA. I have tried various approaches over the years (including the ones endorsed by PyTorch), but the only solution I have found that seems to work flawlessly for everyone who uses it is to have no Python or PyTorch dependence in the compiled code, and to load the compiled libraries using ctypes. I use an old version of nvcc to compile the CUDA, use manylinux2014 for the Linux builds, and ask users to install PyTorch themselves before installing my package.
Simulacra2 days ago
Good writeup. PyTorch has generally been very good to me when I can mitigate its resource hogging at times. Production can be a little wonky but for everything else it works