We clone a running VM in 2 seconds (2022)(codesandbox.io)

169 pointsby hunvreus3 months ago19 comments

CompuIves3 months ago
Oh wow! Unexpected and cool to see this post on Hacker News! Since then we have evolved our VM infra a bit, and I've written two more posts about this.
First, we started cloning VMs using userfaultfd, which allows us to bypass the disk and let children read memory directly from parent VMs [1].
And we also moved to saving memory snapshots compressed. To keep VM boots fast, we need to decompress on the fly as VMs read from the snapshot, so we chunk up snapshots in 4kb-8kb pieces that are zstd compressed [2].
Happy to answer any questions here!
[1]: https://codesandbox.io/blog/cloning-microvms-using-userfault...
[2]: https://codesandbox.io/blog/how-we-scale-our-microvm-infrast...
- zekrioca3 months ago
  The example code present in the link is not available. Would you know where it went? Thanks, great article!
- pulkitsh12343 months ago
  Enjoyed reading all these and learnt a lot! Thanks for taking the time out to write the blogs!
1970-01-013 months ago
The VM vs container debate is fascinating. They are separate yet slowly merging concepts that are becoming more blurred as technology becomes cheaper and faster. If the real bottleneck to scale is adaptable code, then it is foolish to dismiss the VM as outdated tech when it can be completely rehomed in 2 seconds. That megabyte of python code managing your containers would still be busy checking it's dependencies in that same timeframe.
- znpy3 months ago
  In a way it's nothing new. MOSIX/openMosix (https://en.wikipedia.org/wiki/MOSIX, https://en.wikipedia.org/wiki/OpenMosix) did similar stuff with individual processes. It would probably be even faster as then you would have to only move process memory and its state rather than the whole VM memory (and it state).
  I guess it would/could be nice to have something that moves kubernetes pods around rather killing them and starting new ones.
- jiggawatts3 months ago
  IMHO the real advancement was the implicit snapshot Docker takes after every line in a Dockerfile.
  Virtual Machine builder scripts like Packer could do this… but don’t.
  It’s a choice, not an inherent technology limitation.
  - dontlaugh3 months ago
    I have the opposite opinion, the implicit overlay filesystem in Docker is an unnecessary and frustrating complication.
londons_explore3 months ago
Unmentioned: there are serious security issues with memory cloning code not designed for it.
For example, an SSL library might have pre-calculated the random nonce for the next incoming SSL connection.
If you clone the VM containing a process using that library, now both child VM's will use the same nonce. Some crypto is 100% broken open if a nonce is reused.
- CompuIves3 months ago
  Yes, that's right. The Firecracker team has written a fantastic doc about this as well: https://github.com/firecracker-microvm/firecracker/blob/main....
  It's important to refresh entropy immediately after clone. Still, there can be code that didn't assume it could be cloned (even though there's always been `fork`, of course). Because of this, we don't live clone across workspaces for unlisted/private sandboxes and limit the use case to dev envs where no secrets are stored.
- hedora3 months ago
  I was about to say you were being paranoid, then I read the article. It hadn’t occurred to me that anyone would be so reckless!
  The proposed workflow involves cloning your dev environment and sharing it with the internet.
  At most places, that’s equivalent to publishing your production keys, or at least github credentials.
  Even for open source projects where confidentiality doesn’t matter, there are issues like using cargo/npm/etc keys to launch supply chain attacks.
  Your nonce attack is harder to pull off, but more devastating if the attacker can man in the middle things like dependency downloads.
- sunshinekitty3 months ago
  GCP’s ‘live migrations’ have been doing this for close to a decade or more. Must not be that big of a problem.
  - londons_explore3 months ago
    It isn't a problem if you guarantee only one child of the clone lives on - which GCP does.
    matt-p3 months ago
    How do we know that isn't enforced here too?
    jsnell3 months ago
    Because their main selling point is to run the copies concurrently with the original.
  - oceanplexian3 months ago
    Live Migration on VMWare has been a thing before Google even had a cloud service.
    tanelpoder3 months ago
    VMware even has a vSphere Fault Tolerance product that creates a "live shadow instance" of a VM that mirrors the primary virtual machine (with up to 4 vCPUs). So you can do a quick failover in case of an "immediate planned" failover case, but apparently even when the primary DB goes down. I guess this might work when some external system (like a storage array) goes down in the primary, you can just switch to the other VM (with latest memory/CPU state) and replay that I/O there and keep going... But if there's a hard crash of the primary, if it actually does work, then they must be doing lots of reasoning about internal state change ordering & external device side-effect (somewhat like Antithesis, but for a different purpose). Back in the day, they supported only uniprocessor VMs (with something called vLockstep) and later up to 4 vCPUs with something called Fast Checkpointing.
    I've always wanted to test this out for fun, by now 15 years have gone by and I've never got to it...
    https://www.vmware.com/products/cloud-infrastructure/vsphere...
    umachin3 months ago
    VMware has also had a patent on live VM cloning (called it VMfork) for quite a few years now. I worked on the team that built related features. Feature itself was in the desktop product. https://blogs.vmware.com/euc/2016/02/horizon-7-view-instant-...
    Live migration had some very cool demos. They would have an intensive workload such as a game playing and cause a crash and the VM would resume with 0 buffering.
- dietr1ch3 months ago
  A neat use case for cloning is not truly duplicating a machine, but moving it from one machine that will go off to another one.
  There's caveats in the network though, as packets targeting the old address need to be re-routed until all connections target the new machine.
- hypeatei3 months ago
  > might have pre-calculated the random nonce
  Isn't this still a concern even if you're not pre-calculating way ahead of time? If you generate it when needed, it could still catch you at the wrong time (e.g. right before encryption, but right after nonce generation)
  - zamadatix3 months ago
    Unless your encryption and transport protocols are 100% stateless only 1 connection will actually be able to form, even if you duplicate the machine during connection creation.
    The problem with pre-computing a bunch and keeping them in memory is brand new connections made post cloning would use the same list of nonces.
- generalizations3 months ago
  Sounds like it would simply be inappropriate to clone & use a VM that's assuming it's data is unique. This would also be true of other conditions, e.g. if you needed to spoof a MAC or IPv6 address & picked one randomly.
  - londons_explore3 months ago
    The problem is modern software is so fiendishly complicated there almost certainly is stuff like that in the code. The question is where, and does it matter?
    generalizations3 months ago
    And the last question is, can the parts with stuff like that be extracted from the rest and run separately?
- perching_aix3 months ago
  I don't really follow, what's the issue with that? The two nodes will encrypt using the same key, so they can snoop at each other's traffic that they send out? Doesn't sound that big of a deal per se.
  - Rygian3 months ago
    A nonce is not a key, it's a piece of random that is meant to be used at most once.
    If an attacker sees valid nonces on a VM, and knows of another VM sharing the same nonces, then your crypto on both* VMs becomes vulnerable to replay attacks.
    *read: all
    nodesocket3 months ago
    How would a reply attack work in production assuming multiple VMs share a nonce?
    saagarjha3 months ago
    You record the traffic going to one VM and send it to another, which will now accept it because the nonce is the same.
    trollied3 months ago
    “Number ONCE”. NONCE. Indeed.
  - londons_explore3 months ago
    Reusing a nonce often allows the entire world to decrypt or MITM the data.
pragma_x3 months ago
I'm starting to see a pattern here. This describes a technology that rapidly deploys "VM" instances in the cloud which support things like Lambda and single-process containers. At what point do we scale this all back to a more rudimentary OS that provides security and process management across multiple physical machines? Or is there already a Linux distro that does this?
I ask because watching cloud providers like AWS slowly reinvent mainframes just seems like the painful way around.
- robszumski3 months ago
  We were working on this at CoreOS before Kubernetes came about – called fleet https://github.com/coreos/fleet. Distributed systemd units run across a cluster, typically running containers or golang binaries with a super minimal OS underneath. I always thought it was cool but it definitely had its challenges and Kubernetes is better in most ways, IMO.
- zer00eyz3 months ago
  > I ask because watching cloud providers like AWS slowly reinvent mainframes just seems like the painful way around.
  When AWS was the hot new thing in town a server was coming in at 12/24 threads.
  A modern AMD machine tops out at 700+ threads and 400gb QSFP interconnects. GO back to 2000 and the Dotcom boom and thats a whole mid sized company, in a 2u rack.
  Finding single applications that can leverage all that horsepower is going to be a challenge... and thats before you layer in lift for redundancy.
  Strip away all the bloat, all the fine examples of Conways law that organizations drag around (or inherit from other orgs) and compute is at a place where it's effectively free... With the real limits/costs being power and data (and these are driven by density).
- pabs33 months ago
  There was a multi-machine single-Linux-kernel-instance distro many years ago called Kerrighed. The company behind it died unfortunately so it hasn't kept up with Linux kernel patch rebasing. It offered a "view of a unique SMP machine on top of a cluster of standard PCs".
  https://en.wikipedia.org/wiki/Kerrighed https://sourceforge.net/projects/kerrighed/
- no_wizard3 months ago
  EDIT: leaving the answer, but I simply misinterpreted what they meant. This isn't the same thing
  BSD has had jails for a long time, which let you achieve isolation on a system in this manner, or at least close to it.
  - hedora3 months ago
    They’re missing multi-machine orchestration: Run thousands of jails on these dozen machines. Don’t bother me with the details at runtime.
    They are also missing an ergonomic tool like dockerfiles. The following file, plus a cli tool for “run N copies on my M machines” should be enough to run bsd in prod, and it is not:
    “FROM openbsd:latest ; CMD pkg -i apache ; echo “apache=enabled >> /etc/rc.defaults ; COPY public_html /var/www/ ; CMD init”
    I don’t think writing the tooling would be that difficult, but it was missing the last time I looked.
    no_wizard3 months ago
    I think I may have simply misinterpreted what you meant. You're right, its not Dockerfile-esque easy
- jerf3 months ago
  We've been cycling around that wheel for a while.
  If there's any difference now versus the past, it is that I think right now pretty much every point on the wheel is available quite readily now. If you want a more "rudimentary OS" you don't need to wait for the next turning of the wheel, it's here now. Need full VMs? Still a practical technology. Containers enough? Actively in development and use. Mix & match? Around any sensible combination you can do it now. And so on.
Imustaskforhelp3 months ago
In the minecraft example video
We are shown a person who quit the server and then the server stops and restarts (that 2 second clone of vm)
but what if I have a service like lets say normal minecraft servers like hypixel or others, they can't hope for a 2 second delay. Maybe we would have to use proxies in that case.
I am genuinely interested by this tech.
Currently, I am much in favour of tinykvm and its snapshotting because its even lighter than firecracker(I think). I really like the dev behind tinykvm as well.
mschuster913 months ago
> How to handle network and IP duplicates on cloned VMs
That is indeed what I would love to read the most! Because no matter what you do, it gets complex - if you tear down the network stack of the "old" VM, applications (like Minecraft) might be heading down into unstable territory when the listener socket disappears and the "new" VM has to go through the entire DHCP flow that may easily take a second or more, and if you just do the equivalent of S3 sleep (suspend to RAM), the first "new" VM will have everything working as expected but any further VM being spawned from the template will run into duplicate IP/MAC address usage.
simonklitj3 months ago
Interesting read—thanks! One question: in the CoW example, if VM A modifies the data post-fork, what does VM B see when it later copies that data? Does it get the original data from the time of the fork, or VM A’s modified version?
- CompuIves3 months ago
  I talk a bit about this here: https://codesandbox.io/blog/cloning-microvms-using-userfault.... Before VM A updates its data, the data is copied over to VM B if VM B hadn't written/read that data yet.
  - simonklitj3 months ago
    clever! Thank you.
dang3 months ago
Related:
We clone a running VM in 2 seconds - https://news.ycombinator.com/item?id=38651805 - Dec 2023 (10 comments)
manish_gill3 months ago
They tried running minecraft, but I wonder if a similar (or better) cloning is possible for a mission critical workload - like a database consuming a huge amount of memory. Neon uses QEMU to achieve this for example: https://neon.tech/docs/reference/glossary#live-migration but is that the only way?
bravura3 months ago
This is an increasingly important area, with LLM generated code, and am curious about people's experiences with codesandbox vs e2b vs daytona
- nkko3 months ago
  Check this out: Spinning 100 Agents in Daytona https://youtu.be/OFFFyfgO2ik we should also soon release open source speed benchmark
mystraline3 months ago
Different proposal:
Let's say we have 2 Linux machines. Identical hardware, identical libs.
I'd like to run a simple program on one machine, and then during mid-calculation, would like to transfer the running program to the other machine.
Is this doable?
- toast03 months ago
  A search for 'linux process live migration' picks up at least one repo that claims to have done it, and a bunch of other interesrting things.
  For a very simple program, with limited I/O, it's not too hard; especially if you don't mind a significant pause to move. Difficulty comes when you have FDs to migrate and if you need to reduce the pausing. If you need to keep FDs to the filesystem or the program will load/store to the filesystem periodically, you'd need to do a filesystem migration too... If you need to keep FDs for network sockets, you've got to transfer those somehow.
  If it's just stdin/out/err, you could probably do the migration in userspace with some difficulty if you need to keep pid constant (but maybe you don't need that either).
  Minimal pausing involves letting the program run on the initial machine while you copy memory, setting pages to read-only so you can catch writes, and only pausing the program once the copy is substantially finished. Then you pause execution on the initial machine. If there's a significant amount of modified pages to copy over when you pause, you can still start execution on the new machine, as long as the modified pages are marked unavailable, if you background copy them before they're used great... if not, you have to block until the modified data comes through.
  Probably you do this on two nearby machines with fast networking, and the program doesn't have a lot of writes all over memory, so the pause should be short.
  - dilyevsky3 months ago
    If you're talking about Criu then it's not just a claim it actually does work well in production. I know Google was using it in prod on their internal systems and probably many others. It even can migrate TCP connections for you via socket repair api in Linux
  - wang_li3 months ago
    >...keep FDs for network sockets, you've got to transfer those somehow.
    And if you have any shared memory segments, semaphores, or message queues, you have to drag along a bunch of other processes.
- dilyevsky3 months ago
  Yes - using Criu[0] or docker checkpoint/restore api (which uses criu)
  [0] -https://criu.org/Main_Page
- panki273 months ago
  Interesting thought, but highly dependant on the actual program. Let's assume it doesn't touch any files on disk (no opening sockets either). You would need to at least
  1. Halt the process (SIGSTOP comes to mind)
  2. Create a copy of the running program and /proc/$pid - which will also include memory and mmap details
  3. Transfer everything to the other machine
  4. Load memory, somehow spawn a spawn a new process with the info from /proc/$pid we saved, mmap the loaded memory into it
  5. Continue the process on the new machine (SIGCONT)
  Let me admit that I do not have the slightest clue how to achieve step 4. I wonder if a systemd namespace could make things easier.
- tryauuum3 months ago
  if you put you program in a qemu/kvm VM then it just works
  I was completely blown away when I first experienced it. My code running in a VM never even noticed any downtime. All the network connections are preserved and so on.
- new_user_final3 months ago
  Unrelated, but somewhat similar in higher level, you can transfer state with durable execution, e.g temporal.io.
  Instead of RAM, program's state saved in DB and execution environment resume in the previous state when restarted
  - WJW3 months ago
    How does such a method retain things like open network connections that have significant kernel state involved as well?
    dilyevsky3 months ago
    it does not. all the state that you need to make "durable" needs to be explicitly committed in temporal via their sdk
NitpickLawyer3 months ago
Cool article! The stack (and results) are impressive, but I also appreciate the article in itself, starting from basics and getting to the point in a clear and slowly expanding way. Easy to follow and appreciate.
On a bit of a tangent rant, this kind of writing is slowly going away, taken over by LLM slop (and I'm a huge fan of LLMs, just not the people who write those kinds of articles). I was recently looking for real world benchmarks for vllm/sglang deployments of DeepSeek3 on a 8x 96GB pod, to see if the model fits into the amount of RAM, with kv cache and context length, what numbers to people get, etc.
Of the ~20 articles that google surfaced on various attempts of keywords, none were what I was looking for. The excerpts seemed promising, some even offered tables & stuff related to ds3 and RAM usage, but all were LLM crap. All were written in that simple style - intro - bla bla - conclusion, some even had RAM requirements that made no sense (running a model trained in FP8 in 16bit, something noone would do, etc.)
- fxtentacle3 months ago
  While I fully agree with you on the absence of good benchmarks and the growing LLM slop ...
  "running a model trained in FP8 in 16bit, something noone would do, etc"
  I did that because on the RTX 3090 - which can be a good bang per buck for inference - the FP8 support is nerfed at the driver level. So a kernel that upscales FP8 to FP16 inside SRAM, then does the matmul, then downscales to FP8 again can bring massive performance benefits on those consumer cards.
  BTW, you can run a good DeepSeek3 quant on a single H200.
  - NitpickLawyer3 months ago
    Thanks! I was looking at blackwell 6000PROs, 8x 96GB for running full fp8 (as it's supported and presumably fast).
    I know AWQ should run, and be pretty snappy & efficient w/ the new MLA added, but wanted to check if fp8 fits as well, because from a simple napkin math it seems pretty tight (might only work for bs1, ctx_len <8k which would probably not be suited for coding tasks).
- thomasjudge3 months ago
  You are describing this as a writing problem, but it sounds more like a search results/search engine problem
nodesocket3 months ago
Has anybody tried running ollama and Open WebUI in firecracker instead of full VMs? I assume this should work, but not sure about GPU (single and multi) passthrough.
- nkko3 months ago
  As far as I understand Firecracker, you can't do a GPU passthrough
walrus013 months ago
> "Virtual machines are often seen as slow, expensive, bloated and outdated. "
By who, exactly? Citations needed
- sofixa3 months ago
  By almost everyone who has done a comparison.
  VMs have a full OS that needs to be maintained (patched, upgraded when EOL, etc.).
  Hypervisors traditionally cost a metric crapton of money per core. Yes, Proxmox is pretty good, but it's the exception, not the norm. They're also relatively slow in spinning up new VMs (kind of by definition, it takes a lot of time to emulate a full blown replica of hardware vs just starting a process in a cgroup/jail).
  And most of all, VMs are just solving the wrong problem. You don't care about emulating hardware, you care about running some workload. Maybe it needs specific hardware or a virtual version of it, but more likely than not, it's a regular batch processor or API that can happily run in a container with almost none of the overhead of a full VM.
  - tryauuum3 months ago
    while you are correct in calling the VM startup slow compared to the container startup, reading "emulating hardware" burns my eyes
    modern VMs don't emulate hardware. When a VM has a hard drive or a network device there's no sophisticated code to trick VM into believing that this is real hardware. Virtio drivers are about VM writing data in a memory area and assuming it's written to the disk / sent to the network (because in the background hypervisor reads the same memory area and does the job)
    sofixa3 months ago
    > modern VMs don't emulate hardware
    They provide pretend hardware which isn't really necessary.
- awestroke3 months ago
  It's not an uncommon viewpoint. Especially if you compare them to containers.
lofaszvanitt3 months ago
What problem is this supposed to solve?
comprev3 months ago
Needs [2022] in the title
phgn3 months ago
(2022)
hobofan3 months ago
[2022]
nimbius3 months ago
>Virtual machines are often seen as slow, expensive, bloated and outdated.
by whom?
I tend to loathe firecracker posts because theyre all just thinly veiled ads for Amazon services.
Firecracker is not included in the standard linux KVM/QEMU duo and has sparse documentation. you cannot deploy a firecracker image like a traditional VM. in fact there are no tools to assist in creating a firecracker VM, and the filesystem for the VM must be EXT4.
TL;DR: this is all fun stuff if youre 200% cloud, but most companies run a ton of on-prem vms as well.
- hhh3 months ago
  I was using ignite for a while to create firecracker vms, i think it’s called flintlock now. Ignite worked great when I was using it.