What we are lacking now is tooling that gives you insight into how you should configure Iceberg. Does something like this exist? I have been looking for something that would show me the query plan that is developed from Iceberg metadata, but didn’t find anything. It would go a long way to showing where the bottleneck is for queries.
Regarding reading materials, I found this DuckDB post to be especially helpful in realizing how parquet could be better leveraged for efficiency: https://duckdb.org/2024/03/26/42-parquet-a-zip-bomb-for-the-...
Tends to be that an optimal file size for Parquet is about 1GiB, once again, the "many small files" problem of Hadoop remains.
Then it's things like, can you organise your data in such a way to take advantage of RLE etc.?
I worry with Iceberg that people think it's just a case of "use an Iceberg table in Snowflake" and boom, amazingly fast querying of data in S3!
Is Iceberg "easy" to set up? No.
Can you get set up in a week? Yes.
If you really need a datalake, spending a week setting it up is not so bad. We have a guide[0] here that will get you started in under an hour.
For smaller (e.g. under 10tb) data where you don't need real-time, DuckDB is becoming a really solid option. Here's on setup[1] we've played around with using Arrow Flight.
If you don't want to mess with any of this, we[2] spin it all up for you.
0 - https://www.definite.app/blog/cloud-iceberg-duckdb-aws
I have a vision for a way to make it work. I made another comment here. Your blog posts were helpful, I digged a bit in the Duck Takes Flight code in python and rust.
You can setup a data lake, save data and start doing queries in like 10 minutes with this setup.
My company has 100s of data pipelines that are executed infrequently.
For this use case Athena is ridiculously cheap and easy to use vs most other solutions.
And some times, if your query is CPU extensive but the queried data size is not huge you can get a ridiculous value for money, like many CPU-days in 10 minutes for just $5 if your query covers 1TB after partitioning.
Query size limits are also configurable.
Obviously it depends on what data you are working on, but not having to set up and pay for a computational cluster is a huge cost saving.
A lot of people worry would worry about "vendor lock-in" here, but it's certainly convenient.
I feel like this could be a game changer for the ecosystem. It's more cpu and network heavy for writes but the reads are always fast. And the writes are still faster than pyiceberg.
I want to hear opinions or how this could never work.
Also, https://paimon.apache.org/ seems to be better for streaming use cases.
Iceberg (and Delta Table format) is really OLAP-optimized, being built on a columnar datastore, Parquet. This means it will be slow to do writes compared to a traditional row-based datastore and doesn't really have normal/optimal OLTP indexing.
Fast OLTP + Fast OLAP + low latency is best done via HTAP-type databases which store data in both row and columnar form and give you ability in the SELECT clause to pick your latency tolerance and the query engine will pick the OLTP engine if it knows there are still some OLAP writes queued up that entered the system more than <latency-timeframe> ago but aren't fully on disk yet.
Various vendors do have HTAP, but all with proprietary storage engines and query engines. But Iceberg alone doesn't get you there. I haven't seen discussion of this I don't know if anyone has tried to write both Hudi and Iceberg/Delta in parallel so they could do HTAP; maybe they use pure Hudi instead?
I'd have to re-look at Hudi to see if it's deferred compaction is more like this. XTable doesn't seem to target this issue.
I'm using a basic implementation that's not backed by iceberg, just Parquet files in hive partitions that I can query using DuckDB.
Another option if you have enough local storage would be to use something like JuiceFS that creates a virtual file system where the files are initially written to the local cache before JuiceFS writes the data to your S3 provider as larger chunks.
SeaweedFS can do something similar if you configure it the right way. But both options require that you have enough storage outside of your object storage.
https://github.com/seaweedfs/seaweedfs/wiki/Cloud-Drive-Bene...
https://github.com/seaweedfs/seaweedfs/wiki/Cloud-Tier
https://github.com/seaweedfs/seaweedfs/wiki/Benchmarks
https://github.com/seaweedfs/seaweedfs/wiki/Words-from-Seawe...
https://github.com/seaweedfs/seaweedfs/wiki/Amazon-S3-API
...your true issue is it seems like you're using the filesystem as the "only" storage layer in play, but you also need time and entity querying(!?!).
>> we need to be able to query on a per sensor basis and a timespan
...look at the "Cloud-Tier" wiki page. If you're truly in an "everything's hot all the time" situation, you really should be using a database. If you're pulling "usually recent stuff, occasionally old stuff" then fronting with something like SeaweedFS seems like it might "just" transparently reduce your overall costs.
Really, I'd nudge towards "write .txt ; compact ... ; SELECT ... && cat .txt".
Basically, keep your inbound writes cached to (eg) seaweed as unit files. "Compact them" every hour by appending rows to some appropriate database (I mean: migrate to using litefs, turso, postgres, something like that). When you read, you may need to supplement "tip" data from your incoming files, but the majority should be hitting a "real" remote database, there's plenty to choose from!
A nifty note, sqlite can connect to multiple DB's at once: https://www.sqlite.org/lang_attach.html ... https://stackoverflow.com/posts/10020/revisions
...something like `select * from raw union (select * from one_hour) union (select * from today) union (select * from historical) ...`
Of course you could also use Aurora for a clean scalable Postgres that can survive zone failures for a simpler solution
Then spin up duckdb and do some performance tests. I’m not sure this will work, there is some overheard with reading parquet, which is why it is discouraged to have small files and row groups.
[1] https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tab...
Hopefully this will come at some point. Product looks very cool otherwise.
Do you ever go back and reaggregate older data into bigger, sorted files? That is, maybe you originally partitioned by hour, but stale data is so infrequently accessed, you could roll up into partitions per week/month/whatever. Depending on the specifics, you might save some space from less file overhead and better compression statistics.
No more files. You might be able to avoid per usage pricing just by hosting this on a regular vps.
> To use all fsspec features, either install via pip install ratarmount[fsspec] or pip install ratarmount[fsspec]. It should also suffice to simply pip install fsspec if ratarmountcore is already installed.
You can also do this with a landing table or even branches+WAP.
This pain is too real, and too close to home. I've seen this outcome turn the entire business off of consuming their data via hadoop because it turns into a wasteland of delayed deliveries, broken datasets, op's teams who cannot scale, and architects overselling too robust designs.
I've tried to scale down hadoop to the business user with visual etl tools like Alteryx, but there again compatibility between Alteryx and hadoop suck via ODBC connectors. I came from an AWS based stack into a poorly leapfrogged data stack and it's hard not to pull my hair out between the business struggling to use it and infra + op's not keeping up. Now these teams want to push to iceburg or big query while ignoring the mountains of tech debt they have created.
Don't get me wrong Hadoop isn't a bad idea, its just complex and a time suck, and unless you have time to dedicate to properly deploy these solutions which most business do not, your implementation will suffer, your business will suffer.
"While the parallels to Hadoop are striking, we also have the opportunity to avoid its pitfalls." no one in IT learns from their failures unless they are writing the checks, most will flip before they feel the pain.
> Yet, competing table formats like Delta Lake and Hudi mirror this fragmentation. [ ... ] > Just as Spark emerged as the dominant engine in the Hadoop ecosystem, a dominant table format and catalog may appear in the Iceberg era.
I think extremely few people are making bets on any other open source table format now - that consolidation has already happened in 2023-2024 (see e.g. Databricks who have their own competing format leaning heavily into iceberg; or adoption from all of the major data warehouse providers).
They seem to be the only vendor crazy enough to try to fast-follow Databricks, who is clearly driving the increasingly elaborate and sophisticated Delta ecosystem (check the GitHub traffic…)
But Microsoft + Databricks is a lot of momentum for Delta.
On the merits of open & simple, I agree, better for everyone if Iceberg wins out—as Iceberg and not as some Frankenstandard mashed together with Delta by the force of 1,000 Databricks engineers.
Regardless of the relative merits now, I think everyone agrees that a few years ago Slack was clearly superior. Microsoft could have certainly bought Slack instead of pumping probably billions into development, marketing, discounts to destroy them.
I think Microsoft could and would consider buying Databricks—$80–100B is a lot, but not record-shattering.
If I were them, though, I’d spend a few billion competing as an experiment, first.
The Trump folks have given mixed messages on the Biden-era FTC; I'd put the odds that with the right tap dancing (sigh) Microsoft could make a blockbuster like this in the B2B space work.
Iceberg out-of-the-box is "NOT" good at streaming use cases, unlike formats like Hudi or Paimon, the table format does not have the concept of merge/ index. However, the beauty of iceberg is it is very unopinionated, so it is indeed possible to design an engine to stream write to iceberg. As far as I know this is how engines like Upsolver was implemented: 1. Have in-memory buffer to track incoming rows before flushing a version to iceberg (every 10s to a few minutes). 2. Build Indexing structure to write position deletes/ deletion vector instead of equality deletes. 3. The writer will all try to merge small files and optimize the table.
And stay tuned, we at https://www.mooncake.dev/ are working on a solution to mirror a postgres table to iceberg, and keep them always up-to-date.
It has a lot of knobs to fiddle with (more than Delta Lake, which tries very hard to come up with good defaults), but even if you don't touch any of them, you already end up with tables that are as good as Hive's, except now your writers don't break your readers.
This is already a massive boon that lets you escape the rigidity of a timetable schedule for your data pipelines. Anything else you can come up with (switching your table to MOR and rewriting it as a separate step etc) is further improvements.