190 pointsby kevmo3143 days ago15 comments
  • kkielhofner14 hours ago
    This is very interesting but many of the motivations listed are far better served with alternate approaches.

    For "remote" model training there is NCCL + Deepspeed/FSDP/etc. For remote inferencing there are solutions like Triton Inference Server[0] that can do very high-performance hosting of any model for inference. For LLMs specifically there are nearly countless implementations.

    That said, the ability to use this for testing is interesting but I wonder about GPU contention and as others have noted the performance of such a solution will be terrible even with relatively high speed interconnect (100/400gb ethernet, etc).

    NCCL has been optimized to support DMA directly between network interfaces and GPUs which is of course considerably faster than solutions like this. Triton can also make use of shared memory, mmap, NCCL, MPI, etc which is one of the many tricks it uses for very performant inference - even across multiple chassis over another network layer.

    [0] - https://github.com/triton-inference-server/server

    • theossuary13 hours ago
      I don't think NCCL + Deepspeed/FSDP are really an alternative to Scuda, as they all require the models in question to be designed for distributed training. They also require a lot of support in the libraries being used.

      This has been a struggle for data scientists for a while now. I haven't seen a good solution to allow a data scientist to work locally, but utilize GPUs remotely, without basically just developing remotely (through a VM or Jupyter), or submitting remote jobs (through SLURM or a library specific Kubernetes integration). Scuda is an interesting step towards a better solution for utilizing remote GPUs easily across a wide range of libraries, not just Pytorch and Tensorflow.

      • kkielhofner3 hours ago
        > I don't think NCCL + Deepspeed/FSDP are really an alternative to Scuda

        I put "remote" in quotes because they're not direct equivalents but from a practical standpoint it's the alternate current approach.

        > they all require the models in question to be designed for distributed training. They also require a lot of support in the libraries being used.

        IME this has changed quite a bit. Between improved support for torch FSDP, Deepspeed, and especially HF Accelerate wrapping of each with transformer models it's been a while since I've had to put much (if any) work in.

        That said if you're running random training scripts it likely won't "just work" but given larger models becoming more common I see a lot more torchrun, accelerate, deepspeed, etc in READMEs and code.

        > This has been a struggle for data scientists for a while now. I haven't seen a good solution to allow a data scientist to work locally, but utilize GPUs remotely, without basically just developing remotely (through a VM or Jupyter)

        Remotely, as in over the internet? 400gb ethernet is already too slow vs PCIe5 x16 (forget SXM). A 10gb internet connection is 40x slower (plus latency impacts).

        Remote development via the internet with scuda would be impossibly and completely uselessly slow.

      • seattleeng9 hours ago
        Why is working locally important?
        • theossuary7 hours ago
          Working locally still matters, and this is from someone who normally works in tmux/nvim. When working on vision and 3D ML work, being able to quickly open a visualizer windows is imperative to understanding what's going on. For Gaussian Splatting, point cloud work, SLAM, etc. you have to have access to a desktop environment to see visualizations; they very rarely work well remotely (even if they have some Jupyter support).

          Working remotely, when having to use a desktop environment is painful, no matter the technology. The best tice come up with is using tmux/vim and sunshine/moonlight, but even still I'd rather just have access to everything locally.

  • some1else17 hours ago
    You might have a problem using CUDA as part of the name, since Nvidia has it trademarked. Maybe you can switch to Scuba if they give you trouble, sounds like a good name for the tool.
    • n3storm11 hours ago
      Buda may Be a Better name
    • teeray12 hours ago
      We need to do for CUDA what was done for Jell-o and Kleenex.
  • dschuetza day ago
    More like "virtual cuda only gpu" over IP.
  • AkashKaStudio17 hours ago
    Would this let Nvidia card be accessible on Apple Silicon over TB4 for training on a e-GPU caddy? Would happily relegate my desktop to HTPC/Gaming duties.
  • ghxsta day ago
    This looks more like CUDA over IP or am I missing something?
  • gpuhackera day ago
    As this mentions some prior art but not rCUDA (https://en.m.wikipedia.org/wiki/RCUDA) I'm a bit confused about what makes scuda different.
    • kevmo31416 hours ago
      I've updated the README! rCUDA is indeed inspiration, in fact it inspired scuda's name too :)
    • dangsuxa day ago
      [dead]
  • saurik21 hours ago
    Reminds me of this, from a couple months ago.

    https://news.ycombinator.com/item?id=41203475

    • friedtofu10 hours ago
      Was going to post a reference to the same thing! Not sure about you but I tested it, and I'm not sure if it was just being hugged to death when I used it or not, but the network performance was incredibly poor.

      Having something that you can self-host, as a user I find this really neat but what I really want is something more like

      https://github.com/city96/ComfyUI_NetDist + OP's project mashed together.

      Say I'm almost able to execute a workflow that would normally require ~16Gb VRAM. I have a nvidia 3060 12Gb running headless with prime/executing the workflow via the CLI.

      Right now, I'd probably just have to run the workflow in a paperspace(or any other cloud compute) container, or borrow the power of a local apple M1 when using the second repository I mentioned.

      I wish I had something that could lend me extra resources and temporarily act as either the host GPU or a secondary depending on the memory needed, only when I needed it(if that makes sense)

  • ranger_dangera day ago
    This appears to only support CUDA on nvidia. I'm curious why they didn't just expose /dev/nvidia-uvm as a socket and forward that over the network instead of hooking hundreds of functions (maybe it's not that simple and I just don't know).
    • monocasaa day ago
      You can't mmap a socket, and mmap is core to how /dev/nvidia-uvm works.
      • afr0ck20 hours ago
        Well, it's not impossible. It's just software after all. You can mmap a remote device file, but you need OS support to do the magical paging for you, probably some sort of page ownership tracking protocol like in HMM [1], but outside a coherence domain.

        I was once working on CXL [2] and memory ownership tracking in the Linux kernel and wanted to play with Nvidia GPUs, but then I hit a wall when I realised that a lot of the functionalities were running on the GSP or the firmware blob with very little to no documentation, so I ended up generally not liking the system software stack of Nvidia and I gave up the project. UVM subsystem in the open kernel driver is a bit of an exception, but a lot of the control path is still handled and controlled from closed-source cuda libraries in userspace.

        tldr; it's very hard to do systems hacking with Nvidia GPUs.

        [1] https://www.kernel.org/doc/html/v5.0/vm/hmm.html [2] https://en.wikipedia.org/wiki/Compute_Express_Link

        • monocasa19 hours ago
          Yeah, the Nvidia stuff isn't really made to be hacked on.

          I'd check out the AMD side since you can at least have a full open source GPU stack to play with, and they make a modicum of effort to document their gpus.

      • majke21 hours ago
        It's a first time I hear about /dev/nvidia-uvm. Is there any documentation on how nvidia API works? Especially, how strong is the multi-tenancy story. Can two users use one GPU and expect reasonable security?

        Last time I checked the GPU did offer some kind of memory isolation, but that was only for their datacenter, not consumer cards.

        • monocasa19 hours ago
          There's not a lot of docs on how it works. It used to be entirely in the closed source driver, now it's mainly a thin bridge to the closed source firmware blob.

          But yes, for more than a decade now even with consumer cards, separate user processes have separate hardware enforced contexts. This is as true for consumer cards as it is for datacenter cards. This is core to how something like webgl works without exposing everything else being rented on your desktop to public Internet. There have been bugs, but per process hardware isolation with a GPU local mmu has been tablestakes for a modern gpu for nearly twenty years.

          What datacenter gpus expose in addition to that is multiple virtual gpus, sort of like sr-iov, where a single gpu can be exposed to multiple CPU kernels running in virtual machines.

      • gorkish14 hours ago
        Granted it requires additional support from your nics/switches, but it is probably straightforward to remote nvidia-uvm with an RDMA server
      • XorNota day ago
        Which seems weird to me: if we're going to have device files, it's super annoying that they actually don't really act like files.

        Like we really should just have enough rDMA in the kernel to let that work.

        • monocasa19 hours ago
          At it's core, this device file is responsible for managing a GPU local address space, and sharing memory securely with that address space in order to have a place to write command buffers and data that the gpu can see. It doesn't really make sense without a heavy memory mapping component.

          A plan 9 like model that's heavily just a standard file would massively cut into gpu performance.

        • gorkish14 hours ago
          I agree with you that making RDMA a more accessible commodity technology is very important for "the future of compute". Properly configuring something like RoCEv2 or Infiniband is expensive and difficult. These technologies need to be made more robust in order to be able to run on commodity networks.
  • gchamonlive16 hours ago
    I have a laptop with a serviceable GPU but only 16gb of ram, and another with a low tier GPU but 32gb of ram. Wondering, will it be too slow to use the later as the control plane and delegate inference to the former laptop using something like comfyui to run text-to-image models?
    • friedtofu32 minutes ago
      I referenced this already, but definitely check out https://github.com/city96/ComfyUI_NetDist?tab=readme-ov-file...

      I guess that depends on what you mean by "too slow". What card is the low tier GPU? A Nvidia Tesla? I've always been under the assumption that when running two cards in parallel the faster card will almost always slow down to the speed of the card with the most memory, though the only reference I have is from using Nvidia SLI with two 8800s almost a decade ago.

      I could also be completely and utterly wrong, would love to hear from anyone in the field of GPU architecture or around it for some clarification though :)

  • Technetium15 hours ago
    It would be nice to have a description added.
  • rtghrhtr14 hours ago
    Everyone hates nvidia but treats ATI as an afterthought. Another completely useless tool to throw on the pile.
    • dahart14 hours ago
      > Everyone hates nvidia but treats ATI as an afterthought.

      Hehe, do you mean AMD?

    • gorkish14 hours ago
      ATI? afterthought, indeed
  • elintknower10 hours ago
    Curious if this could be simplified to provide NVENC over ip?
  • kbumsik19 hours ago
    I have heard NVSwitch is used for GPU-to-GPU interconnection over network.

    How is it different?

    • nsteel18 hours ago
      Isn't this GPU-to-CPU? And really slow. And only CUDA. And over IP. And implemented in software. I think it's really very different.
    • thelastparadise18 hours ago
      Orders of magnitude slower.
  • meowzor18 hours ago
    nice
  • v3ss0na day ago
    [flagged]