For anybody confused, the "Vortex" stuff is the underlying data format used but isn't the database/whatever this website (by the creators of Vortex) is pushing.
No surprise there's nothing to look at, since it's basically a press release posted on their blog.
I'm excited to start doing some experimentation with Vortex to see how it can improve our products.
Great stuff, congrats to Will and team!
Application error: a client-side exception has occurred while loading vortex.dev (see the browser console for more information).
Console: unable to create webgl context
You may be interested in https://github.com/vortex-data/vortex which of course has an overview and links to their docs and benchmark pages.
+ No hardware acceleration enabled.
+ Multiple graphics cards, and browser can't decide which to use.
+ Race conditions that can rarely cause a mount of 3d onto a 2d context (often happens to Unity).
I would think that a GPU isn't just sitting there waiting on a process that's in turn waiting for one query to finish to start the next query, but that a bunch of parallel queries and scans would be running, fed from many DB and object store servers, keeping the GPUs as utilized as possible. Given how expensive GPUs are, it would seem like a good trade to buy more servers to keep them fed, even if you do want to make the servers and DB/object store reads faster.
First is the storage bottleneck. Network-attached storage is usually a bottleneck for uncached data. Then there is CPU work decoding data. Spiral claims that their table format is ready to load by the GPU so they can bypass various CPU-bound decoding stages. Once you eliminate storage and CPU bottlenecks, the remaining bottleneck is usually the PCI bus that sits between the host memory and the GPU, and they can't solve that themselves. (And no amount of parallelization can help when the bus is saturated.) What they can do is use the network, the host bus, and the GPU more efficiently by compressing and packing data with greater mechanical sympathy.
They've left unanswered how they're going to commercialize it, but my guess is that they're going to use a proprietary fork of Vortex that provides extra performance or features, or perhaps they'll offer commercial services or integrations that make it easier to use. The open-source release gives its customers a Reason to Believe, in marketing parlance.
Seems that they are targeting a low-to-no overhead path from s3 bucket to GPU, by targeting: same compression/faster random access, streamed encoding from S3 while in flight, zero copy to GPU.
Not 100% clear on the details, but I doubt that they can actually saturate the cpu/gpu bus, but rather just saturate the GPU utilization, which is itself dependent on multiple possible bottlenecks but generally not on bus bandwidth.
That's not criticism: it literally means you can't do better unless you improve the GPU utilization of your AI model.
When I read "possible extension through embedded wasm encoders" I can already imagine the c++ linker hell required to get this thing included in my project.
I also don't think a lot of people need "ai scale".
If any tools would've supported that.
If you want modern parquet, then you want the Lance format (or LanceDB for DB-like CRUD semantics).
EDIT> Maybe its how some poeple call the 4th dimension time when there is infact a 4th spatial dimension. So I guess if this is the 3rd Data dimension like what is the 4th one?
Who knows, maybe a Web 3.1 will deliver us from Enshitification.
So it's "optimized for machines to consume" meaning the GPU.
Their use case was training ML models where you need to feed the GPU massive datasets as part of training.
They seem to claim that training is now bottlenecked by how quickly you can feed the GPU, that otherwise the GPU is basically "waiting on IO" most of the time and not actual computing because the time goes in just grabbing the next piece of data, transforming it for GPU consumption, and then feeding it into the GPU.
But I'm not an expert, this is just my take from the article.
... i'm gonna make revolutionary claims and grandiose statements like "built for the ai era".
No compromises but isn’t ‘externalising’ a large video the equivalent of storing a pointer in the first example? Can’t really see any other way to understand what that means (it goes to an external system and you store where it is)
No comments.
In the paper you'll notice a large portion of it analyzes Vortex, both standalone and embedded. Definitely worth a read.
> P.S. If you're sttill managing data in spreadsheets, this post isn't for you. Yet.
---
Since I discovered the ECS pattern, I've been curious about backing it with a database. One of the big issues seems to be IO on the database side. I wonder if Spiral might solve this issue.
Then you could save every single state change and scroll back and forth. But I'm not sure if you were looking for that.
Postgres (and MongoDB) are the king and prince of data due to their transactional capabilities.
basically im not sure where the product is hiding under all of this bluster but this doesnt feel very "hacker"-Y
Landing pages of both spiral and vortex are GPU-hugging animations and void of any technical information. Empty nothing-statements like "machine scale". They claim 100x improvements but don't link any metrics.
Maybe this is a "don't hate the player, hate the game" situation but somehow the collective of likeminded AI engineers decided to upvote this post to #1 on HN.
Of course I don't know what benchmarks or performance metrics they might have for the db layer, but it is something.
If this is true I'm inclined to believe their claims.
And if this module provides a benefit I'm sure it will find its way into our stack, just like PostgreSQL did. And PostgreSQL never had $22M to begin with - no shiny marketing, just technological skills.
The whole "donated by spiral" on the vortex.dev website also gives big tax write-off vibes.
IMO best case is that this will be a mongodb scenario, but with the current track record of tech grifters enshittifying everything they might find a creative new way.
I've never heard of this sort of OSS work being used as a tax write-off. Could someone please either clarify, or enlighten me?
I have no idea who exactly is behind this, but to me it does definitely not seem like a no-name open source genius, I assume it is some lucky AI grifter. They have two nicely designed, expensive marketing websites. They have all the legal documents for the parent LLC in Delaware.
The delaware corp "donates" the multi-million-worth tech to linux foundation, and uses it as tax write-off to offset gains from some other lucky AI grifter play the person did.
Just the chuzpe to self-compare yourself to something like PostgreSQL is what gets me. Why can't they just be rich and leave people doing actual work for the benefit of our common good be. No, they must make big blog posts claiming they are the next big thing after PostgreSQL.
So many red flags..
Basically in the US you need a legally recognized entity to hold intellectual property. "Donating" the project involves setting up a "Series LLC" that is nested underneath the top-level Linux Foundation corporation, and donating the IP into it.
Checkout https://docs.linuxfoundation.org/lfx/project-control-center/... and ctrl-f "LF Projects, LLC"
But I think my argument still stands. Linux foundation is a 501(c)(6) nonprofit, see https://www.linuxfoundation.org/legal/bylaws
So you might still be able to do an "intellectual property transfer" to them and use it as a tax write-off. The "LF Projects LLC" is then the new owner, only the operating company who has the ongoing hosting contracts for the websites.
Edit: Not sure if a donation to 501(c)(6) can be used as write-off without using some other legal loopholes. Quick AI search told me that only 501(c)(3) can do the donation tax write-off thing.
I'm sure there are some good tax lawyers behind this, who am I to understand it as a mere mortal I am just jealous.
The motivation is to move the IP and trademark into a separate organization so it's no longer owned by Spiral. This means we can't re-license it later, we'd have to fork it, because the Vortex trademark and all that is controlled by LF.
Donated is the Linux Foundation terminology.
Sadly the last time I filed a tax return there was no way to itemize a Github repo. Alas.
The gist seems to be that they can overcome network latency issues when dealing with huge numbers of smallish objects in S3-like storage systems that need to be fed into GPUs? Yeah, those formats and systems were not designed to feed that type of processor. You’re doing it wrong if this is your problem.
After a lot of nonsense, it sounds like they just reformat the data into something more efficient instead. But they forget about the network latency and blame CPUs for slowing things down? And what was that sidetrack about S3 permissions?
I wouldn’t jump right onto this… well, it’s not clear what this even is exactly. But you can probably wait it out.
how is this significant? surely either the network or the GPU calculations is the bottleneck here?