That obviously has a lot of interesting use cases, but my first assumption was that this could be used to quickly/easily search your filesystem with some prompt like "Play the video from last month where we went camping and saw a flock of turkeys". But that would require having an actual vector DB running on your system which you could use to quickly look up files using an embedding of your query, no?
I guess what I'm asking is: how does VectorVFS enable search besides iterating through all files and iteratively comparing file embeddings with the embedding of a search query? The project description says "efficient and semantically searchable" and "eliminating the need for external index files or services" but I can't think of any more efficient way to do a search without literally walking the entire filesystem tree to look for the file with the most similar vector.
Edit: reading the docs [1] confirmed this. The `vfs search TERM DIRECTORY` command:
> will automatically iterate over all files in the folder, look for supported files and then embed the file or load existing embeddings directly from the filesystem."
[1]: https://vectorvfs.readthedocs.io/en/latest/usage.html#vfs-se...
xattrs are better be forgotten already. it was just as dumb idea as macos resource forks/
For example, POSIX tar files have a defined file format that starts with a header struct: https://www.gnu.org/software/tar/manual/html_node/Standard.h...
You can see that at byte offset 257 is `char magic[6]`, which contains `TMAGIC`, which is the byte string "ustar\0". Thus, if a file has the bytes 'ustar\0' at offset 257 we can reasonably assume that it's a tar file. Almost every defined file type has some kind of string of 'magic' predefined bytes at a predefined location that lets a program know "yes, this is in fact a JPEG file" rather than just asserting "it says .jpg so let's try to interpret this bytestring and see what happens".
As for how it's similar: I don't think it actually is, I think that's a misunderstanding. The metadata that this vector FS is storing is more than "this is a a JPEG" or "this is a word document", as I understand it, so comparing it to magic(5) is extremely reductionist. I could be mistaken, however.
It's just a tool that can read "magic bytes" to figure out what files contains. Very different from what VectorVFS is.
how come?
(besides good luck not forgetting to rsync those xattrs)
I want to point out that this isn’t suitable for any kind of actual things you’d use a vector database for. There’s no notion of a search index. It’s always a O(N) linear search through all of your files: https://github.com/perone/vectorvfs/blob/main/vectorvfs/cli....
Still, fun idea :)
If you gotta gather the data from a lot of different inodes, it is a different story.
Ain't no such thing as zero-overhead indexing. Just because you can't articulate where the overhead is doesn't make it disappear.
The reason I gave (which was accepted) was that the process of creating a proof of concept and iterating on it rapidly is vastly easier in Python (for me) than it is in Go. In essence, it would have taken me at least a week, possibly more, to write the program I ended up with in Golang, but it only took me a day to write it in Python, and, now that I understand the problem and have a working (production-ready) prototype, it would probably only take me another day to rewrite it in Golang.
Also, a large chunk of the functionality in this Python script seems to be libraries - pillow for image processing, but also pytorch and related vision/audio/codec libraries. Even if similar production-ready Rust crates are available (I'm not sure if they are), this kind of thing is something Python excels at and which these modules are already optimized for. Most of the "work" happening here isn't happening in Python, by and large.
1. https://weaviate.io/developers/weaviate/installation/embedde... 2. https://weaviate.io/developers/academy/py/vector_index/flat
For example, if we simply had the ability to have "ordered" files inside folders, that would instantly make it practical for a folder structure to represent "Documents". After all, documents are nothing but a list of paragraphs and images, so if we simply had ordering in file systems we could have document editors which are using individual files for each paragraph of text or image. It would be amazing.
Also think about use cases like Jupyter Notebooks. We could stop using the XML file format, and just make it a folder structure instead. Each cell (node) being in a file. All social media messages and chatbot conversations could be easily saved as folders structures.
I've heard many file copy tools ignore XATTR so I've never tried to use it for this purpose, so maybe we've had the capability all along and just nobody thought to use it in a big way that became popular yet. Maybe I should consider XATTR and take it seriously.
Re: ordered files: depends on FS. e.g. filesystems which use B+ trees will tend to have files (in directories) in lexical order. So in some cases you may not need a new FS:
echo 'for f in *.txt; do cat "$f"; done' > doc.sh; chmod +x doc.sh
=> `doc.sh` in dir produces 'documents' (add newlines / breaks as needed, or add piping through Markdown processor); symlink to some standardized filename 'Process', etc...That said... wouldn't it be nice to have ridiculous easily pluggable features like
echo "finish this poem: roses are red," > /auto-llm/poem.txt; cat ..
:)[1]: chaotic notes: https://kfs.mkj.lt/#welcome (see bullet point list below)
Does VectorVFS do retrieval, or store embeddings in EXT4?
Is retrieval logic obscured by VectorVFS?
If VectorVFS did retrieval with non-opaque embeddings, how would one debug why a file surfaced?
Storing Embeddings with File is interesting concept... we already do it for some file formats (ie EXIF), where this one is generalized... yet you would need to have some actual database to load this data into to process at scale.
Another issue I see is support for different models and embedding formats to make this data really portable - like I can take my file drop it into any system and its embedding "seamlessly" integrates
It's important to remember that the cloud is also invented by the old school and understanding the oscillation between client/server architectures vs local, and it's implication on topics of data and files is interesting too.
More questions means more learning until I learned there's no one right or wrong, just what works best, where, when, for how long, and what the tradeoffs are.
Quick wins/decisions are often bandaids that pile up in an different way.
- hard links (only tar works for backup)
- small file size (or inodes run out before disk space)
http://root.rupy.seIt's very useful for global distributed real-time data that don't need the P in CAP for writes.
(no new data can be created if one node is offline = you can login, but not register)
Let me ask another question: is this intended for production use, or is it more of a research project? Because as a user I care about things like speed, simplicity, flexibility, and robustness.
Yeah, that's the kind of thing that I've wanted, but not really had the programming skill/experience/patience to make. There have been a couple of similar projects, but nothing that seems popular enough to be worthwhile spending time using.
I'd be surprised if cloud storage services like OneDrive don't already do some kind of vector for every file you store. But an online web service isn't the same as being built into the core of the OS.
I invented it because I found searching conventional file systems that support extended attributes to be unbearably slow.
Microsoft saw the tech support nightmare this could generate, and abandoned the project.
It was also complex, ran poorly, and would have required developers to integrate their applications.
Microsoft had long solved the problem of blobs and metadata in ESE and SharePoint's use of MS SQL for binary + metadata storage.
I mean, for some definitions of “just”, “SQL database”, and “arbitrary data.” :) It was a schematised graph database implemented on top of a slimmed-down version of SQL Server. The query language was not SQL-based.
> It was abandoned due to The Cloud.
It was discontinued circa 2007. The cloud was much less of a Thing back then. I don’t recall that factoring at all into the decision to cancel the project, though it would have been prescient.
(Disclaimer: I was on the WinFS team at Microsoft.)
But fair enough, I grabbed my Beta 1 copy from \\products; it was fun to play with. I wish they’d seen it through. Microsoft had plenty of 'slimmed down' versions of SQL Server, i.e. the CRM addin for Outlook, so that isn't quite a unique feature of WinFS.
Maybe all the nuances aren't fully communicated publicly when a project is cancelled, but I don’t recall having a sense that what was said publicly was any different than our understanding internally. But that was almost 20 years ago.
The majority of the teams I was on during my time there were ‘internal startups’: Mira, NetGen, WinFS, MatrixDB. Like startups anywhere, projects being unceremoniously cancelled was par for the course.
When BigQuery was still in alpha I had to ingest ~15 billion HTTP requests a day (headers, bodies, and metadata). None of the official tooling was ready, so I wrote a tiny bash script that:
1. uploaded the raw logs to Cloud Storage, and
2. tracked state with three folders: `pending/`, `processing/`, `done/`.
A cron job cycled through those directories and quietly pushed petabytes every week without dropping a byte. Later, Google’s own pipelines—and third-party stacks like Logstash—never matched that script’s throughput or reliability.Lesson: reach for the filesystem first; add services only once you’ve proven you actually need them.
[1] https://en.wikipedia.org/wiki/Everything_is_a_file [2] https://en.wikipedia.org/wiki/Unix_philosophy
I would add that filesystems are superior to data formats (XML, JSON, YAML, TOML) for many use cases such as configuration or just storing data.
- Hierarchy are dirs,
- Keys are file names,
- Value is the content of the file.
- Other metadata are in hidden files
It will work forever, you can leverage ZFS, Git, rsync, syncthing much better. If you want, a fancy shells like Nushell will bring the experience pretty close to a database.
Most important you don't need fancy editor plugins or to learn XPath, jq or yq.
1. For config, it spreads the config across a bunch of nested directories, making it hard to read and write it without some sort of special tool that shows it all to you at once. Sure, you can easily edit 50 files from all sorts of directories in your text editor, but that’s pretty painful.
2. For data storage is that lots of smaller files will waste partial storage blocks in many file systems. Some do coalesce small files, but many don’t.
3. For both, it’s often going to be higher performance to read a single file from start to finish than a bunch of files. Most file systems will try to keep file blocks in mostly sequential order (defrag’d), whereas they don’t typically do that for multiple files in different directories. SSD makes this mostly a non-issue these days, however. You still have the issue of openings, closings, and more read calls, however.
It really depends how comfortable you are using the shell and which one you use.
cat, tree, sed, grep, etc will get you quite far and one might argue that it is simpler to master than vim and various format. Actually mastering VSCode also takes a lot of efforts.
> 2. For data storage is that lots of smaller files will waste partial storage blocks in many file systems. Some do coalesce small files, but many don’t.
> 3. For both, it’s often going to be higher performance to read a single file from start to finish than a bunch of files. Most file systems will try to keep file blocks in mostly sequential order (defrag’d), whereas they don’t typically do that for multiple files in different directories. SSD makes this mostly a non-issue these days, however. You still have the issue of openings, closings, and more read calls, however.
Agreed but for most use case here it really doesn't matter and if I need to optimise storage I will need a database anyway.
And I sincerely believe that most micro optimisations at the filesystem level are cancelled by running most editors with data format support enabled....
I'm being slightly hypocritical because I've made plenty of use of the filesystem as a configuration store. In code it's quite easy to stat one path relative to a directory, or open it and read it, so it's very tempting.
We were building Reblaze (started 2011), a cloud WAF / DDoS-mitigation platform. Every HTTP request—good, bad, or ugly—had to be stored for offline anomaly-detection and clustering.
Traffic profile
- Baseline: ≈ 15 B requests/day
- Under attack: the same 15 B can arrive in 2-3 hours
Why BigQuery (even in alpha)?It was the only thing that could swallow that firehose and stay query-able minutes later — crucial when you’re under attack and your data source must not melt down.
Pipeline (all shell + cron)
Edge nodes → write JSON logs locally and a local cron push to Cloud Storage
Tiny VM with a cron loop
- Scans `pending/`, composes many small blobs into one “max-size” blob in `processing/`.
- Executes `bq load …` into the customer’s isolated dataset.
- On success, moves the blob to `done/`; on failure, drops it back to `pending/`.
Downstream ML/alerting* pulls straight from BigQueryThat handful of `gsutil`, `bq`, and `mv` commands moved multiple petabytes a week without losing a byte. Later pipelines—Dataflow, Logstash, etc.—never matched its throughput or reliability.
Maybe with micro-kernels we'll finally fix this.
Almost all of the operations done on actual filesystems are not database like, they are close to the underlying hardware for practical reasons. If you want a database view, add one in an upper layer.
There was no query language for updating files, or even inspecting anything about a file that was not published in the EAs (or implicitly do as with adapters), there were no multi-file transactions, no joins, nothing. Just rich metadata support in the FS.
However, I think it is reasonable to think that with way more time and money, these things would meet up. Think about it as digging a tunnel from both sides of the mountain.
Whenever we're talking about interfaces, coordination success or failure is the name of the game.
Directories are a shitty underpowered way to organize data?
No good transactions
Conflation of different needs such as atomic replace vs log-structured
I would like to use a better database instead.
Could you provide reference information to support this background assertion? I'm not totally familiar with filesystems under the hood, but at this point doesn't storage hardware maintain an electrical representation relatively independent from the logical given things like wear leveling?
- You can reason about block offsets. If your writes are 512B-aligned, you can be ensured minimal write amplification.
- If your writes are append-only, log-structured, that makes SSD compaction a lot more straightforward
- No caching guarantees by default. Again, even SSDs cache writes. Block writes are not atomic even with SSDs. The only way to guarantee atomicity is via write-ahead logs.
- The NVMe layer exposes async submission/completion queues, to control the io_depth the device is subjected to, which is essential to get max perf from modern NVMe SSDs. Although you need to use the right interface to leverage it (libaio/io_uring/SPDK).
Not all devices use 512 byte sectors, an that is mostly a relic from low-density spinning rust;
> If your writes are append-only, log-structured, that makes SSD compaction a lot more straightforward
Hum, no. Your volume may be a sparse file on SAN system; in fact that is often the case in cloud environments; also, most cached RAID controllers may have different behaviours on this - unless you know exactly what your targeting, you're shooting blind.
> No caching guarantees by default. Again, even SSDs cache writes. Block writes are not atomic even with SSDs. The only way to guarantee atomicity is via write-ahead logs.
Not even that way. Most server-grade controllers (with battery) will ack an fsync immediately, even if the data is not on disk yet.
> The NVMe layer exposes async submission/completion queues, to control the io_depth the device is subjected to, which is essential to get max perf from modern NVMe SSDs.
Thats storage domain, not application domain. In most cloud systems, you have the choice of using direct attached storage (usually with a proper controller, so what is exposed is actually the controller features, not the individual nvme queue), or SAN storage - a sparse file on a filesystem on a system that is at the end of a tcp endpoint. One of those provides easy backups, redundancy, high availability and snapshots, and the other one you roll your own.
To say that that's not true would require more than cherry-picking examples of where some fileystem assumption may be tenuous, it would require demonstrating how a DBMS can do better.
> Not all devices use 512 byte sectors
4K then :). Files are block-aligned, and the predominant block size changed once in 40 years from 512B to 4K.
> Hum, no. Your volume may be a sparse file on SAN system
Regardless, sequential writes will always provide better write performance than random writes. With POSIX, you control this behavior directly. With a DBMS, you control it by swapping out InnoDB for RocksDB or something.
> Thats storage domain, not application domain
It is a storage domain feature accessible to an IOPS-hungry application via a modern Linux interface like io_uring. NVMe-oF would be the networked storage interface that enables that. But this is for when you want to outperform a DBMS by 200X, aligned I/O is sufficient for 10-50X. :)
On the contrary, filesystems are a specialized database; hardware interface optimizations are done at volume or block device level, not at filesystem level; every direct hardware IO optimization you may have on a kernel-supported filesystem is a leaky VFS implementation and an implementation detail, not a mandatory filesystem requirement.
> 4K then :). Files are block-aligned, and the predominant block size changed once in 40 years from 512B to 4K.
They are, but when the IO is pipelined through a network or serial link, intermediary buffer sizes are different; also, any enterprise-grade controller will have enough buffer space for the difference between a 4k block or a 16k one is negligible;
> Regardless, sequential writes will always provide better write performance than random writes. With POSIX, you control this behavior directly. With a DBMS, you control it by swapping out InnoDB for RocksDB or something.
Disk offsets are linear offsets, not physical; the current system still works as a bastardization of the notion of logical blocks, not physical ones; there is no guarantee that what you see as a sequential write will actually be one locally, let alone in a cloud environment. In fact, you have zero guarantee that your EBS volume is not actually heavily fragmented on the storage system;
>But this is for when you want to outperform a DBMS by 200X, aligned I/O is sufficient for 10-50X. :)
When you want to outperform some generic dbms; because a filesystem is a very specific dbms.
The NVMe layer is not the same as the POSIX filesystem, there is no reason we need to throw that as part of knocking the POSIX filesystem off it's privileged position.
Overall you are talking about individual files, but remember what really distinguishes the filesystem is directories. Other database, even relational ones, can have binary blob "leaf data" with the properties you speak about.
I think "guaranteed" is too strong a word given the number of filesystems and flags out there, but "largely" you get aligned I/O.
> The NVMe layer is not the same as the POSIX filesystem, there is no reason we need to throw that as part of knocking the POSIX filesystem off it's privileged position.
I'd say that the POSIX filesystem lives in an ecosystem that makes leveraging NVMe layer characteristics a viable option. More along with the next point.
> Overall you are talking about individual files, but remember what really distinguishes the filesystem is directories. Other database, even relational ones, can have binary blob "leaf data" with the properties you speak about.
I think regardless of how you use a database, your interface is declarative. You always say "update this row" vs "fseek to offset 1048496 and fwrite 128 bytes and then fsync the page cache". Something needs to do the translation from "update this row" to the latter, and that layer will always be closer to hardware.
Mature database implementations also bypass a lot of kernel machinary to get closer to the underlying block devices. The layering of DB on top of FS is a failure.
Common usage does this by convention, but that's just sloppy thinking and populist extentional definitining. I posit that any rigorous, thought-out, not overfit intentional definition of a database will, as a matter of course, also include file systems.
Would you store all your ~/ in something like SQLite database?
Actually yeah that sounds pretty good.
For Desktop/Finder/Explorer you'd just need a nice UI.
Searching Documents/projects/etc would be the same just maybe faster?
All the arbitrary stuff like ~/.npm/**/* would stop cluttering up my ls -la in ~ and could be stored in their own tables whose names I genuinely don't care about. (This was the dream of ~/Library, no?)
[edit] Ooooh, I get it now. This doesn't solve namespacing or traversal.
That's "just" API. FS is "just" a KV store with a weird crufty API and a few extra tricks (bind mounts or whatever).
I think the primary issue is the difference in performance between different strategies. It would be interesting to have a FS with different types of folders similar to how (for example) btrfs is generally CoW but you can turn that off via an attribute.
1. Distributed filesystems do often use databases for metadata (FoundationDB for 3FS being a recent example)
2. Using a B+ tree for metadata is not much different from having a sorted index
3. Filesystems are a common enough usecase that skipping the abstraction complexity to co-optimize the stack is warranted
Persistent file systems are essentially key-value stores, usually with optimizations for enumerating keys under a namespace (also known as listing the files in a directory). IMO a big problem with POSIX filesystems is the lack of atomicity and lock guarantees when editing a file. This and a complete lack of consistent networked API are the key reasons few treat file systems as KV stores. It's a pity, really.
"Userspace vs not" is a different argument from "consistency vs not" or "atomicity vs not" or "POSIX vs not". Someone still needs to solve that problem. Sure instead of SQLite over POSIX you could implement POSIX over SQLite over raw blocks. But you haven't gained anything meaningful.
> Persistent file systems are essentially key-value stores
I think this is reductive enough to be equivalent to "a key-value store is a thin wrapper over the block abstraction, as it already provides a key-value interface, which is just a thin layer over taking a magnet and pointing it at an offset".
Persistent filesystems can be built over key-value stores. This is especially common in distributed filesystems. But they also circumvent a key-value abstraction entirely.
> IMO a big problem with POSIX filesystems is the lack of atomicity
Atomicity requires write-ahead logging + flushing a cache. I fail to see why this needs to be mandatory, when it can be effectively implemented at a higher layer.
> This and a complete lack of consistent networked API
A consistent networked API would require you to hit the metadata server for every operation. No caching. Your system would grind to a halt.
Finally, nothing in the POSIX spec prohibits an atomic filesystem or consistency guarantees. It is just that no one wants to implement these things that way because it overprovisions for one property at the expense of others.
This was an attempt to possibly explain the microkernel point GP made, which only really matters below the FS.
> I think this is reductive enough to be equivalent to "a key-value store is a thin wrapper over the block abstraction, as it already provides a key-value interface, which is just a thin layer over taking a magnet and pointing it at an offset".
I disagree with this premise. Key-value stores are an API, not an abstraction over block storage (though many are or can be configured to be so). File systems are essentially a superset of a KV API with a multitude of "backing stores". Saying KV stores are always backed by blocks is overly reductive, no?
> Atomicity requires write-ahead logging + flushing a cache. I fail to see why this needs to be mandatory, when it can be effectively implemented at a higher layer.
You're confusing durability for atomicity. You don't need a log to implement atomicity, you just need a way to lock one or more entities (whatever the unit of atomic updates are). A CoW filesystem in direct mode (zero page caching) would need neither but could still support atomic updates to file (names).
> A consistent networked API would require you to hit the metadata server for every operation. No caching. Your system would grind to a halt.
Sorry, I don't mean consistent in the ACID context, I mean consistent in the loosely defined API shape context. Think NFS or 9P.
I also disagree with this to some degree: pipelined operations would certainly still be possible and performant but would be rather clunky. End-to-end latency for get->update-write, the common mode of operation, would be pretty awful.
> Finally, nothing in the POSIX spec prohibits an atomic filesystem or consistency guarantees. It is just that no one wants to implement these things that way because it overprovisions for one property at the expense of others.
I didn't say it did, but it doesn't require it which means it effectively doesn't exist as far as the users of FS APIs are concerned. Rename operations are the only API that atomicity is required by POSIX. However without a CAS-like operation you can't safely implement a lock without several extra syscalls.
You seem unhappy with POSIX because its guarantees feel incomplete and ad hoc (they are). You like databases because their guarantees are more robust (also true). DBMS over POSIX enables all the guarantees that you like. I'd want to invoke the end-to-end systems argument here and say that this is how systems are supposed to work: POSIX is closer to the hardware, and as a result it is messier. It's the same reason TCP in-order guarantees are layered above the IP layer.
Some of your points re: how the lower layers work seem incorrect, but that doesn't matter in the interest of the big picture. The suggestion (re: microkernels) seems to be that POSIX has a privileged position in the system stack, and that somehow prevents a stronger alternative from existing. I'd say that your gripes with POSIX may be perfectly justified, but nothing prevents a DBMS from owning a complete block device, completely circumventing the filesystem. POSIX is the default, but it is not really privileged by any means.
It almost seems like a ridiculous idea to me for a database component author to want to write there own filesystem instead of improving their DB feature set. I hear the gripes in this thread about filesystems, but they almost sound service level user issues, not deeper technical issues. What I mean by that, is the I/O strategies I've seen from the few open source storage engines i've looked at don't at all seem hindered by the filesystem abstractions that are currently offered. I don't know what a DBMS has to gain from different filesystem abstractions.
The safest way to put the FS on a level-playing field with other interfaces is to make the kernel not know about, just as it doesn't know about, say, SQL.
I'll try to do an example. The kernel doesn't currently know about SQL. Instead, you e.g. connect to a socket, and start talking to postgres. Imagine if FS stuff was the same thing: you connect to a socket, and then issue various command to read and write files. Ignore perf for a moment, it works right?
Now, one counter-argument might be "hold up, what is this socket you need to connect to, isn't that part of a file system? Is there now an all-userspace inner filesystem, still kernel-supported 'meta filesystem'?" Well, the answer to that is maybe the Unix idea of making communication channels like pipes and (to a lesser extent) sockets, was a bad idea. Or rather, there may be nothing wrong with saying a directory can have a child which may be such a communication channel, but there is a problem with saying that every such communication channel should live inside some directory.