When people see "duckdb", they're going to think they can slap it into their local analytics workflows, but it turns out that's not a good idea.
> A lightweight data processing framework built on DuckDB and 3FS.
https://github.com/deepseek-ai/smallpond
https://news.ycombinator.com/item?id=43200793
I didn't find anything of value in this article.
Did enjoy https://mehdio.substack.com/p/duckdb-goes-distributed-deepse... some, which eventually talks about smallpond being built on Ray, and… Smallpond actually running multiple partitioned duckdb instances?! Wow.
> Using smallpond and 3FS depends largely on your data size and infrastructure:
> Under 10TB: smallpond is likely unnecessary unless you have very specific distributed computing needs. A single-node DuckDB instance or simpler storage solutions will be simpler and possibly more performant.
> 10TB to 1PB: smallpond begins to shine. You'd set up a cluster with several nodes, leveraging 3FS or another fast storage backend to achieve rapid parallel processing.
> Over 1PB (Petabyte-Scale): smallpond and 3FS were explicitly designed to handle massive datasets. At this scale, you'd need to deploy a larger cluster with substantial infrastructure investments.
Makes it very easy to determine if this would be useful for me and how much work I would expect to do to use it.
IMO pretty obvious, surface level, information and some prose on each bullet.
(because obviousness is subjective and depends on the knowledge, experience, and context of the audience)
IMO means “in my opinion.” I used that phrase to express how the following statement is my opinion and not a universal truth. My “audience” in this case is myself.
I do that because otherwise there’s always a comment saying how things like “obvious” can be subjective.
I also used the word “pretty” to, again, soften the word “obvious” so that readers don’t think that it’s a universal truth.
go on...
like people talking about 1gbit iSCSI, and no one thought to say that 120MB/s, which is technically slower than ATA/133 which came out twenty years ago, might be the bottleneck. Obviously 10gbit will be "as fast as a local drive"!
Yes, exactly right! This means you need to buy additional hardware, like network cards[0], and possibly gbic and fiber optics.
Adding ec2 instances is trivial, setting up 3FS is hard.
Don’t feel bad. I just didn’t think AI generated bullet points were as impressive as the comment I was replying to did.
one benefit for me personally: you should be able to move from local dev to cloud more easily.
I guess it comes down to how well written the distributed workflows are, there's a lot to get wrong, but in theory it should be able to achieve very impressive numbers.
My reasoning behind this is Dask, which uses Pandas under the hood being capable of better benchmarks than Spark, I think this is partly some good optimisations, but also simply that pandas is faster than spark's row based model. Duckdb is on some benchmarks more than 10x faster than pandas, you can see where this is going. . .
Edit: I've since changed the title above to the article title, in keeping with the site guidelines (https://news.ycombinator.com/newsguidelines.html). It has been taking me a while to figure out what we're looking at here!
1. the particular release was sudden, unexpected, and not highly pre-advertised or post-advertised — as in an album being "dropped" by a band (where the band more often "releases" albums.) Usage of "dropped" here evokes the feeling that the releaser is casually "dropping" the thing in the public square and walking away, leaving it there to be studied. A band would release an album by going on tour selling it; or they might just drop an album on Spotify one day.
2. the particular release was a single limited production run / limited-time event — where people were anticipating something would be released at a certain specific time, but there was no advance statement from the releaser of exactly what people would be getting. Strong analogy with the NYE "ball drop" — the release is an event that people count down to or line up for. (Think: dropping a new limited-edition colorway of a product people ravenously collect — sneakers, Stanley cups, etc.)
3. the particular release was a bounded-in-size batch or "tranch" of production, all put out to be purchased at once where "once they sell out, they sell out" for now — but with the expectation that the releaser is producing more, but where this will take time, during which the item will remain sold out. (Often, the item has actually been produced in quantity, and this limited dribbling-out and repeated fast selling-out is purely a marketing technique to induce hype and demand.) This usage isn't a figurative extension of the literal verb "drop" — but rather a shortening of the word "airdrop", as in military resupply and/or NFTs. You would be more likely to see this phrased as "[X] dropped another [Y]" or "[X] dropped more [Y]"; or perhaps "there was a drop of [Y] today."
- https://boards.straightdope.com/t/where-did-the-term-album-d... (2009) - https://www.talkbass.com/threads/when-did-release-become-dro... (2013)
But it _has_ spread much faster outside of the music scene these last few years, e.g. describing software and products.
Are claims valid for <10TB, 10TB -> 1PB and over 1PB?