First, we started cloning VMs using userfaultfd, which allows us to bypass the disk and let children read memory directly from parent VMs [1].
And we also moved to saving memory snapshots compressed. To keep VM boots fast, we need to decompress on the fly as VMs read from the snapshot, so we chunk up snapshots in 4kb-8kb pieces that are zstd compressed [2].
Happy to answer any questions here!
[1]: https://codesandbox.io/blog/cloning-microvms-using-userfault...
[2]: https://codesandbox.io/blog/how-we-scale-our-microvm-infrast...
I guess it would/could be nice to have something that moves kubernetes pods around rather killing them and starting new ones.
Virtual Machine builder scripts like Packer could do this… but don’t.
It’s a choice, not an inherent technology limitation.
For example, an SSL library might have pre-calculated the random nonce for the next incoming SSL connection.
If you clone the VM containing a process using that library, now both child VM's will use the same nonce. Some crypto is 100% broken open if a nonce is reused.
It's important to refresh entropy immediately after clone. Still, there can be code that didn't assume it could be cloned (even though there's always been `fork`, of course). Because of this, we don't live clone across workspaces for unlisted/private sandboxes and limit the use case to dev envs where no secrets are stored.
The proposed workflow involves cloning your dev environment and sharing it with the internet.
At most places, that’s equivalent to publishing your production keys, or at least github credentials.
Even for open source projects where confidentiality doesn’t matter, there are issues like using cargo/npm/etc keys to launch supply chain attacks.
Your nonce attack is harder to pull off, but more devastating if the attacker can man in the middle things like dependency downloads.
I've always wanted to test this out for fun, by now 15 years have gone by and I've never got to it...
https://www.vmware.com/products/cloud-infrastructure/vsphere...
Live migration had some very cool demos. They would have an intensive workload such as a game playing and cause a crash and the VM would resume with 0 buffering.
There's caveats in the network though, as packets targeting the old address need to be re-routed until all connections target the new machine.
Isn't this still a concern even if you're not pre-calculating way ahead of time? If you generate it when needed, it could still catch you at the wrong time (e.g. right before encryption, but right after nonce generation)
The problem with pre-computing a bunch and keeping them in memory is brand new connections made post cloning would use the same list of nonces.
If an attacker sees valid nonces on a VM, and knows of another VM sharing the same nonces, then your crypto on both* VMs becomes vulnerable to replay attacks.
*read: all
I ask because watching cloud providers like AWS slowly reinvent mainframes just seems like the painful way around.
When AWS was the hot new thing in town a server was coming in at 12/24 threads.
A modern AMD machine tops out at 700+ threads and 400gb QSFP interconnects. GO back to 2000 and the Dotcom boom and thats a whole mid sized company, in a 2u rack.
Finding single applications that can leverage all that horsepower is going to be a challenge... and thats before you layer in lift for redundancy.
Strip away all the bloat, all the fine examples of Conways law that organizations drag around (or inherit from other orgs) and compute is at a place where it's effectively free... With the real limits/costs being power and data (and these are driven by density).
https://en.wikipedia.org/wiki/Kerrighed https://sourceforge.net/projects/kerrighed/
BSD has had jails for a long time, which let you achieve isolation on a system in this manner, or at least close to it.
They are also missing an ergonomic tool like dockerfiles. The following file, plus a cli tool for “run N copies on my M machines” should be enough to run bsd in prod, and it is not:
“FROM openbsd:latest ; CMD pkg -i apache ; echo “apache=enabled >> /etc/rc.defaults ; COPY public_html /var/www/ ; CMD init”
I don’t think writing the tooling would be that difficult, but it was missing the last time I looked.
If there's any difference now versus the past, it is that I think right now pretty much every point on the wheel is available quite readily now. If you want a more "rudimentary OS" you don't need to wait for the next turning of the wheel, it's here now. Need full VMs? Still a practical technology. Containers enough? Actively in development and use. Mix & match? Around any sensible combination you can do it now. And so on.
We are shown a person who quit the server and then the server stops and restarts (that 2 second clone of vm)
but what if I have a service like lets say normal minecraft servers like hypixel or others, they can't hope for a 2 second delay. Maybe we would have to use proxies in that case.
I am genuinely interested by this tech.
Currently, I am much in favour of tinykvm and its snapshotting because its even lighter than firecracker(I think). I really like the dev behind tinykvm as well.
That is indeed what I would love to read the most! Because no matter what you do, it gets complex - if you tear down the network stack of the "old" VM, applications (like Minecraft) might be heading down into unstable territory when the listener socket disappears and the "new" VM has to go through the entire DHCP flow that may easily take a second or more, and if you just do the equivalent of S3 sleep (suspend to RAM), the first "new" VM will have everything working as expected but any further VM being spawned from the template will run into duplicate IP/MAC address usage.
We clone a running VM in 2 seconds - https://news.ycombinator.com/item?id=38651805 - Dec 2023 (10 comments)
Let's say we have 2 Linux machines. Identical hardware, identical libs.
I'd like to run a simple program on one machine, and then during mid-calculation, would like to transfer the running program to the other machine.
Is this doable?
For a very simple program, with limited I/O, it's not too hard; especially if you don't mind a significant pause to move. Difficulty comes when you have FDs to migrate and if you need to reduce the pausing. If you need to keep FDs to the filesystem or the program will load/store to the filesystem periodically, you'd need to do a filesystem migration too... If you need to keep FDs for network sockets, you've got to transfer those somehow.
If it's just stdin/out/err, you could probably do the migration in userspace with some difficulty if you need to keep pid constant (but maybe you don't need that either).
Minimal pausing involves letting the program run on the initial machine while you copy memory, setting pages to read-only so you can catch writes, and only pausing the program once the copy is substantially finished. Then you pause execution on the initial machine. If there's a significant amount of modified pages to copy over when you pause, you can still start execution on the new machine, as long as the modified pages are marked unavailable, if you background copy them before they're used great... if not, you have to block until the modified data comes through.
Probably you do this on two nearby machines with fast networking, and the program doesn't have a lot of writes all over memory, so the pause should be short.
And if you have any shared memory segments, semaphores, or message queues, you have to drag along a bunch of other processes.
1. Halt the process (SIGSTOP comes to mind)
2. Create a copy of the running program and /proc/$pid - which will also include memory and mmap details
3. Transfer everything to the other machine
4. Load memory, somehow spawn a spawn a new process with the info from /proc/$pid we saved, mmap the loaded memory into it
5. Continue the process on the new machine (SIGCONT)
Let me admit that I do not have the slightest clue how to achieve step 4. I wonder if a systemd namespace could make things easier.
I was completely blown away when I first experienced it. My code running in a VM never even noticed any downtime. All the network connections are preserved and so on.
Instead of RAM, program's state saved in DB and execution environment resume in the previous state when restarted
On a bit of a tangent rant, this kind of writing is slowly going away, taken over by LLM slop (and I'm a huge fan of LLMs, just not the people who write those kinds of articles). I was recently looking for real world benchmarks for vllm/sglang deployments of DeepSeek3 on a 8x 96GB pod, to see if the model fits into the amount of RAM, with kv cache and context length, what numbers to people get, etc.
Of the ~20 articles that google surfaced on various attempts of keywords, none were what I was looking for. The excerpts seemed promising, some even offered tables & stuff related to ds3 and RAM usage, but all were LLM crap. All were written in that simple style - intro - bla bla - conclusion, some even had RAM requirements that made no sense (running a model trained in FP8 in 16bit, something noone would do, etc.)
"running a model trained in FP8 in 16bit, something noone would do, etc"
I did that because on the RTX 3090 - which can be a good bang per buck for inference - the FP8 support is nerfed at the driver level. So a kernel that upscales FP8 to FP16 inside SRAM, then does the matmul, then downscales to FP8 again can bring massive performance benefits on those consumer cards.
BTW, you can run a good DeepSeek3 quant on a single H200.
I know AWQ should run, and be pretty snappy & efficient w/ the new MLA added, but wanted to check if fp8 fits as well, because from a simple napkin math it seems pretty tight (might only work for bs1, ctx_len <8k which would probably not be suited for coding tasks).
By who, exactly? Citations needed
VMs have a full OS that needs to be maintained (patched, upgraded when EOL, etc.).
Hypervisors traditionally cost a metric crapton of money per core. Yes, Proxmox is pretty good, but it's the exception, not the norm. They're also relatively slow in spinning up new VMs (kind of by definition, it takes a lot of time to emulate a full blown replica of hardware vs just starting a process in a cgroup/jail).
And most of all, VMs are just solving the wrong problem. You don't care about emulating hardware, you care about running some workload. Maybe it needs specific hardware or a virtual version of it, but more likely than not, it's a regular batch processor or API that can happily run in a container with almost none of the overhead of a full VM.
modern VMs don't emulate hardware. When a VM has a hard drive or a network device there's no sophisticated code to trick VM into believing that this is real hardware. Virtio drivers are about VM writing data in a memory area and assuming it's written to the disk / sent to the network (because in the background hypervisor reads the same memory area and does the job)
They provide pretend hardware which isn't really necessary.
by whom?
I tend to loathe firecracker posts because theyre all just thinly veiled ads for Amazon services.
Firecracker is not included in the standard linux KVM/QEMU duo and has sparse documentation. you cannot deploy a firecracker image like a traditional VM. in fact there are no tools to assist in creating a firecracker VM, and the filesystem for the VM must be EXT4.
TL;DR: this is all fun stuff if youre 200% cloud, but most companies run a ton of on-prem vms as well.