Seems to be a columnar storage format that addresses some shortcomings in parquet. Thing is, though, that of all these formats the real winning feature is compatibility, which is (obviously) very hard to improve on, as anything new immediately loses.
Parquet is unfortunately very good just by virtue of being first, and so widely supported. The most widely used parquet version is the oldest version from 2013 (as per the paper itself), so parquet itself couldn't even supplant parquet. If you want to improve on it, you need to bring some serious results, which I don't think f3 does.
Also, my main gripe with parquet (single table per file) is not even addressed, so, also the name is a bit hyped up.
Also also, it seems to go out of its own way to include a compiled wasm binary for decoding, yet requires flatbuffers to parse that blob? Kind of defeats the purpose.
Its main result seems to be improved random access which, although certainly welcome, is not the point of columnar storage, as columnar storage was invented to exchange random access for something else: fast analytics. F3 seems to sacrifice fast analytics for the wasm decoder. I don't get it.
Maybe I'm being too cynical. Can someone help me out here?
This is really more of an expectation that has been put on file formats by the query engines. Spark/Datafusion/DuckDB wouldn't really know what to do with a multi-table file.
> Parquet is unfortunately very good just by virtue of being first, and so widely supported
IMO that is not how technology works. It is great that Parquet is so good at a lot of things, but that does not mean just because it came first that it deserves to be the only analytic file format forever.
> Its main result seems to be improved random access which, although certainly welcome, is not the point of columnar storage, as columnar storage was invented to exchange random access for something else: fast analytics
Fast analytics, as well as newer ML-shaped workloads, are inherently mix of batch scans and random access.
Some of the authors of F3 previously authored another paper that goes into the details of the shortcomings of Parquet
https://www.vldb.org/pvldb/vol17/p148-zeng.pdf
All of the newer formats that popped up recently (Vortex, Lance, F3 now) have been working on solving the problems outlined in that paper.
Lance has some interesting ideas, Vortex focuses on extensibility and performance by replacing all of Parquet's black-box encoders with fully transparent encodings. This solves the tradeoff between bulk and element decoding, allowing you to have efficient full scans and really fast random access.
E.g. Langchain recently rebuilt a system that used to be all Parquet files to use Vortex and saw a massive speedup, which they talk about more here: https://www.langchain.com/blog/introducing-smithdb
Disclaimer: I work on Vortex, so a lot of these questions about "what is the point of building a new format" are things that I have grappled with myself.
Sure it would, you can attach a multi-table sqlite database in duckdb
> that does not mean just because it came first
I agree with most of your points, I am not stating my opinion but my observations. I am the target audience here, I want to use this, but I don't really care too much about the file format itself, at least not as much as I care about the data inside.
That means access, which means compatibility with my tooling.
Compatibility is hard to beat.
This is the concorde of file formats.
FWIW I think if you are just doing pure analytics and nothing else, Parquet will probably continue to do the job for you just fine, and you don't need to touch your workloads at all.
These new formats I think will find a niche where people aren't just running Spark jobs, but doing lots of systems building over large tables. If you're building a PB-scale data warehouse, you care a lot about the file format b/c it is a big factor in your performance curve, and you're willing to ship new experimental codecs in response to new datatypes you want to support that the system wasn't originally designed for, or you want to use a newly invented compressor.
So I'm with you, I'm very unconvinced that parquet (and the various things that are parquet or essentially-parquet under the hood) are the end of the line here.
> IMO that is not how technology works. It is great that Parquet is so good at a lot of things, but that does not mean just because it came first that it deserves to be the only analytic file format forever.
I think you and vouwfietsman (https://news.ycombinator.com/item?id=48649412) are actually saying the same thing in different words—I think their "unfortunately" means "it is unfortunate that, by virtue of coming first, this now has a support lead that will make it difficult for anyone else to catch up."
When I was working with parquet, I imagined a .parquetz file format which was just a zip file containing any number of uncompressed parquet files. So you could sling multiple tables around in a single file, and still use range requests to access them.
Frankly it's a change from the usual ChatGPT generated slop that most landing pages are these days.
> "Each self-describing F3 file includes both the data and meta-data, as well as WebAssembly (Wasm) binaries to decode the data. Embedding the decoders in each file requires minimal storage (kilobytes) and ensures compatibility on any platform in case native decoders are unavailable. "If this will pan out security-wise I don't know. I'm more worried that it will be so slow that no one will use it. Interesting idea, though, and I can see applications outside of the "big data" realm this apparently targets.
You could have some kind of OOM killer, but that will be a "footgun" that people who are actually doing "big data" will constantly shoot.
This pretty much kills any ingestion pipeline where the source is untrusted.
“Some code is untrusted” does not mean code should never be executed. There are more use cases with trusted sources than untrusted.
Either engines should put some limit (e.g. VARCHAR(2000) to enforce length to be limited to 2000, but there are some other engines supporting unlimited BLOBs), or decoder should give a hint what is the maximum length it will yield. Unfortunately current research level project does not have such considerations implemented yet...
Not that Wasm engines don't have bugs, but the whole point is to have an extremely solid, well-specified and efficient implementation of a widely accepted bytecode format. We can scope down the capabilities given to any program to a minimal set.
As a random example that's an area of personal interest to me, I know of 3 distinct methods of achieving userland ROP execution of the Nintendo Switch 2, and all three rely on the (ab)use of a scripting engine (even if they aren't a vulnerability in the scripting engine itself).
But seriously, if your format requires extensibility to the point that it embeds a bytecode, especially a Turing-complete bytecode, what format are you going to choose? Just design a new one? That's how you end up with a scripting engine with three ROP exploits.
They don't currently either do they? It's the tight coupling of the interface layer no? I'm not sure this would be faster, or more secure so reliability might be the best usecase?
I've heard that kind of sentiment many times before. It's not a good (thought-terminating) mindset to have for any secure software.
There are several WASM implementations, WASM is just a format. "Pure functions" are pure at a superficial level. Many people say that they don't mutate global state, but they do ... it's just hidden. The decoders "not needing a lot of access" doesn't matter if the WASM engine is pwned through arbitrary code execution inside the environment, or if it's contorted to bypass the access control you are mentioning through various side-effects.
You need to run a WASI environment for that.
Then there is a helper in this case to de-serialize, "primitive_array_from_buffers()"
https://github.com/future-file-format/F3/blob/bd92506447dc13...
Hell, Node.js didn't even get this ability until LAST MONTH:
https://nodejs.org/en/blog/release/v26.1.0
You'd have to write a second library to interface the C ABI with Node via NAPI just to consume it.
What do you mean by C bindings? C bindings to what?
When I'm working with data I'm working in a specific set of languages. Usually one. Yeah, other people might be working in other languages, but no individual author really needs a language-agnostic way of accessing data beyond compile time. Add to that the likely runtime boundaries that may need to be crossed instead of e.g. inlined by the compiler because it's in-language and dealing with known offsets or tags (depends on the data format of course). To the other commenter's point, am I going to have to sandbox all data access code just to be sure it's not able to do something unexpected? There's a lot of complexity here. And the inherent risk is going to slow down the operation that should be the simplest and fastest: interpreting bytes.
So this is really about making a file that is forwards compatible in a way that lets you push the standards more than existing formats.
That's so untrue! People need language-agnostic ways to access data all the time, and people work with data accessing them from multiple languages all the time!
If I have parquet files I can load them in duckdb, in pandas and polars, process them with various independent tools, and loads of other things... and people do that.
This is also why people like something like an SQL database, your data is not locked to some specific language / lib for access.
It isn't necessary, because settings timeouts or other resource restrictions works way better to prevent DoS.
It isn't sufficient, because even if you can prove that a program will halt at some point, this alone doesn't tell you how long it will take. What good does it do to know that the program will run for 10 years before it halts? By that time, service will already have been denied. Even turning hash table lookups from O(1) to O(n) (still very much terminating!) can result in a DoS.
Yes, which is why nobody uses PDFs.
Maybe I would pick the eBPF VM instead, with all its limiting and verifying mechanics.
> This security update resolves a publicly disclosed vulnerability in Microsoft Windows. The vulnerability could allow remote code execution if a user opens a specially crafted document or visits a malicious Web page that embeds TrueType font files.
> This security update is rated Critical for all supported releases of Microsoft Windows. For more information, see the subsection, Affected and Non-Affected Software, in this section.
> The security update addresses the vulnerability by modifying the way that a Windows kernel-mode driver handles TrueType font files. For more information about the vulnerability, see the Frequently Asked Questions (FAQ) subsection for the specific vulnerability entry under the next section, Vulnerability Information.
[1] https://www.bleepingcomputer.com/news/security/facebook-disc...
[2] https://blakecrosley.com/blog/truetype-hinting-swift-migrati...
A file is a bag of bytes. You can send those bytes to different things, like a text editor's content-stream, or as the input to a WASM interpreter.
What you decide to do with the bytes in a file is your own prerogative. Each byte is whatever you make of it.
The WASM encoders/decoders are embedded resources that exist as byte offsets in the file metadata, not header info.
Compare that to JSON. The parser NEVER needs to execute arbitrary instructions. Parser might have bugs, but it avoids a whole class of issues.
> the attacker can embed whatever WASM payload they want into the file since the file will be “opened” by “execute this offset into the file”.
And then do what with it?WASM physically cannot interact with the underlying host or perform I/O -- you need a WASI environment for that.
I'd say at worst it's setup for poor security
Doing `head foo.exe` is quite different than `run foo.exe`
If I encode executable instructions in "image.png" and then send them to an interpreter that runs those instructions, the file extension doesn't matter.
What context am I missing?
Shortcomings of Parquet are mentioned as overcome by this, which ones? Certainly not wide tool support...
Why should one leave Parquet or ORC for this structure?
I found this post interesting,
- https://medium.com/@reliabledataengineering/f3-the-future-pr...
I guess one use case is that I come up with a video compression scheme that's better than H.265, but not all platforms support it, so I embed a decoder that would allow me to play it back on legacy hardware. But that also shows the weakness of the idea: it's unlikely that legacy hardware will perform well doing software-only decode for video formats from the future. If we rolled this idea out in the 1990s, it would not have allowed watching Netflix on an i386.
In the same vein, I doubt this would have allowed me to open Word 2021 files in Word 97. There's no 1-to-1 mapping between the data structures. So if this kind of compat isn't slam-dunk, what's the goal?
The downsides are clear. First, it's probably a maintenance nightmare: if your decoder has a bug that needs fixing, how do you patch all the files that already embed it? And then, there's size overhead and security risks. We're adding a considerable attack surface to every format parser. It's more opportunities for remote code execution, resource exhaustion attacks, and so on. Again, this is not always wrong, but what's the benefit?
I admit I only skimmed the beginning of the paper, and maybe the format is less general than it sounds.
Like, can this file be efficiently mmap'd? Maybe if it emulates tar internally, but you don't know until you run it. Can it be seeked to specific bytes to only decompress part? It only supports a pre-release version of ISO-36898533 seeking, and your file library dropped support for it 6 years ago. If I rewrite 1MB in the middle, can it only change those pages on disk (and maybe an index), or do I have to rewrite the whole thing? Well the wasm blob supports 97 different APIs for it (there are 35 copies of one with different names), so it's larger than the data (but nobody paid attention to that), so you have 19 options that you recognize, but your CPU's native WASM accelerator only handles two or three so you've still got to specialize your code heavily.
At least with "*.tar.gz" you have some idea of what's possible.
I think you might get some traction if you post the advantages over parquet and other files directly on the readme, so that if someone goes to https://github.com/future-file-format/f3 the see why they should try it.
Mention the advantages and post metrics. Cherry pick the metrics! There's probably a good use case for this but, from the current readme, it's not clear who should use this and why.
Additionally, putting the decoding logic inside an WASM binary introduces an active execution layer into what should be a cold storage.
the same sandboxing capability exists for WASM as well.
it is actually better for long-term archival: you dont need to carry decompression program, since it will be a part of the archive file itself
In the "future."
Nimble? Lance? Also in the future. Maybe.
I'll use Parquet in the present.
Also, f3 is already “fight-flash-fraud”.
F3: Open-source data file format for the future [pdf] - https://news.ycombinator.com/item?id=45437759 - Oct 2025 (125 comments)
plus this bit:
An Open File Format for storing the information from a forge - https://news.ycombinator.com/item?id=44043253 - May 2025 (1 comment)
I see many replies criticizing F3 as an operational data format, like Parquet. Of course it can't be made as fast in the general case, or as compatible to the existing infrastructure.
OTOH F3 would be easy to decode into almost any of today's accepted formats, and likely to any of tomorrow's data formats. That's where being self-describing and self-unpacking would be important.
It doesn't explain what the project does (a file format for what? Name dropping other things I haven't heard of isn't useful)
There are no examples. It links to a flatbuffer schema which is at least well commented, but is full of deep implementation details.
The point is that within 2-3 minutes I'm not convinced why I care and still don't know enough about what this is to even think back to if if I encounter a scenario in the future where it would be useful.
> designed with efficiency, interoperability, and extensibility in mind. It provides a data organization that rectifies the layout shortcomings of the last-generation formats like Parquet,
This is all marketing speak that says nothing.
> maintaining good interoperability and extensibility (a.k.a future-proof) via embedded Wasm decoders What does this even mean? Providing a decoder is no guarantee of futureproofness.
BTW, while we're on the topic. I don't do social media. I occasionally type up a text or post on a technical board. Maybe 98% of my textual interaction these days is with LLMs. I would not be surprised if my prose changes to resemble theirs over time. I suppose that's symbiosis for ya. It's possible that your AI-dar might get even more ineffective.
Has nimble/velox had any better luck lately? I forget what stories someone shared, but, it seemed to have such big intent, then real trouble actually getting released. I want to say someone was saying the lawyers ended up not letting a lot of the work get released. Nimble is the one competitor benchmarked against here that beats them, and is also extensible (to some degree?), so I'd love to know how things have gone for the past 6-12 months for nimble/velox. https://news.ycombinator.com/item?id=39995112 https://github.com/facebookincubator/nimble/ https://materializedview.io/p/nimble-and-lance-parquet-kille...
F3: Open-source data file format for the future
Previous discussion: