Understanding Smallpond and 3FS(www.definite.app)

262 pointsby mritchie7124 months ago8 comments

ok1234564 months ago
https://github.com/deepseek-ai/smallpond
- dang4 months ago
  We should probably be having a thread about that actual release, so I've re-upped https://news.ycombinator.com/item?id=43200793, will move most of the comments thither, and will post links to this blog post and the other one that people have been referencing.
  - mritchie7124 months ago
    The repo had already been posted. The reason I wrote the post is it's a bit hard to understand how you'd actually use smallpond for analytics.
    When people see "duckdb", they're going to think they can slap it into their local analytics workflows, but it turns out that's not a good idea.
    dang4 months ago
    That makes sense and I didn't mean to imply that there was anything wrong with either your post or submitting it to HN! Both are good. It's just that it makes more sense for the community to first discuss the main thing itself.
- westurner4 months ago
  smallpond: https://github.com/deepseek-ai/smallpond :
  > A lightweight data processing framework built on DuckDB and 3FS.
- mritchie7124 months ago
  updated.
jauntywundrkind4 months ago
Smallpond. Runs on their RDMA powered 3fs ("fire-flyer file system") filesystem.
https://github.com/deepseek-ai/smallpond
https://news.ycombinator.com/item?id=43200793
I didn't find anything of value in this article.
Did enjoy https://mehdio.substack.com/p/duckdb-goes-distributed-deepse... some, which eventually talks about smallpond being built on Ray, and… Smallpond actually running multiple partitioned duckdb instances?! Wow.
memco4 months ago
Love this straightforward analysis of use cases:
> Using smallpond and 3FS depends largely on your data size and infrastructure:
> Under 10TB: smallpond is likely unnecessary unless you have very specific distributed computing needs. A single-node DuckDB instance or simpler storage solutions will be simpler and possibly more performant.
> 10TB to 1PB: smallpond begins to shine. You'd set up a cluster with several nodes, leveraging 3FS or another fast storage backend to achieve rapid parallel processing.
> Over 1PB (Petabyte-Scale): smallpond and 3FS were explicitly designed to handle massive datasets. At this scale, you'd need to deploy a larger cluster with substantial infrastructure investments.
Makes it very easy to determine if this would be useful for me and how much work I would expect to do to use it.
- dartos4 months ago
  I very much felt like that entire portion of the article was ai generated, actually.
  IMO pretty obvious, surface level, information and some prose on each bullet.
  - xixixao4 months ago
    Saying something is “obvious” without specifying an audience is meaningless.
    (because obviousness is subjective and depends on the knowledge, experience, and context of the audience)
    dartos4 months ago
    Notice the “IMO pretty” before the word “obvious”
    IMO means “in my opinion.” I used that phrase to express how the following statement is my opinion and not a universal truth. My “audience” in this case is myself.
    I do that because otherwise there’s always a comment saying how things like “obvious” can be subjective.
    I also used the word “pretty” to, again, soften the word “obvious” so that readers don’t think that it’s a universal truth.
  - genewitch4 months ago
    with some "no s, sherlock" on the ">1PB will require additional infra."
    go on...
    like people talking about 1gbit iSCSI, and no one thought to say that 120MB/s, which is technically slower than ATA/133 which came out twenty years ago, might be the bottleneck. Obviously 10gbit will be "as fast as a local drive"!
    Yes, exactly right! This means you need to buy additional hardware, like network cards[0], and possibly gbic and fiber optics.
    mritchie7124 months ago
    I updated the post. In this case, I meant "exotic" infra... e.g. 3FS isn't like adding more EC2 instances.
    Adding ec2 instances is trivial, setting up 3FS is hard.
    7thpower4 months ago
    You’ve been wanting to get this off your chest for a while haven’t you.
  - fs1114 months ago
    The authors are Chinese so they may simply use AI to make it sound right in English
    varispeed4 months ago
    I had a Chinese co-worker and something like this was actually his style of writing, no use of AI, because I was sitting next to him few times when he was writing documents.
  - mritchie7124 months ago
    some was AI generated, but I made sure everything was accurate. I'd normally rewrite everything, but I wrote this quickly before I had to leave the house. Didn't think it'd be on the front page!
    dartos4 months ago
    Not judging you for using AI for a post like this!
    Don’t feel bad. I just didn’t think AI generated bullet points were as impressive as the comment I was replying to did.
- jimmyl024 months ago
  I wonder at which scale spark fits into this picture and what the tradeoffs / benefits would be
  - mritchie7124 months ago
    spark is certainly the incumbent for this sort of thing.
    one benefit for me personally: you should be able to move from local dev to cloud more easily.
  - benrutter4 months ago
    Yeah I reeeaaally want to see benchmarks! Single node duckdb is absolutely insane (as in fast) performance wise, especially compared to something like Spark. There's been a lot of speed focussed work in the project and I don't know of any faster data processing (I'm not counting traditional SQL since a lot of the speed benefits there come from indexing etc and essentially doing additional work ahead of time).
    I guess it comes down to how well written the distributed workflows are, there's a lot to get wrong, but in theory it should be able to achieve very impressive numbers.
    My reasoning behind this is Dask, which uses Pandas under the hood being capable of better benchmarks than Spark, I think this is partly some good optimisations, but also simply that pandas is faster than spark's row based model. Duckdb is on some benchmarks more than 10x faster than pandas, you can see where this is going. . .
DannyPage4 months ago
“Releases” is used in the article - instead of “drops” - and would be a clearer title.
- dang4 months ago
  Ok, fixed now. (Submitted title was "DeepSeek Drops Distributed DuckDB")
  Edit: I've since changed the title above to the article title, in keeping with the site guidelines (https://news.ycombinator.com/newsguidelines.html). It has been taking me a while to figure out what we're looking at here!
- conqrr4 months ago
  Drop in the context of Databases isn't even close to anything being released or launched. Drop = Delete. Release is a much better word for this context.
  - joshuaturner4 months ago
    Even in the context of an application stack - my initial read had me believing they were moving away from DuckDB
  - mritchie7124 months ago
    yeah, I thought drop was amusing in this case paired with the tautogram
    freehorse4 months ago
    It was, but people here prioritise lexixal inambiguity rather than fun.
- dboreham4 months ago
  Not only clearer, but 180 degrees different in meaning.
  - 4ndrewl4 months ago
    I thought "dropped" these days meant released? Not helpful I know...
    kaashif4 months ago
    I was surprised because I thought the title meant they dropped support or something. Weird because I'd never heard of distributed DuckDB.
    derefr4 months ago
    In denotation, "dropped" can be used equivalently to "released", yes; but in connotation, using "dropped" instead of "released" implies either that:
    1. the particular release was sudden, unexpected, and not highly pre-advertised or post-advertised — as in an album being "dropped" by a band (where the band more often "releases" albums.) Usage of "dropped" here evokes the feeling that the releaser is casually "dropping" the thing in the public square and walking away, leaving it there to be studied. A band would release an album by going on tour selling it; or they might just drop an album on Spotify one day.
    2. the particular release was a single limited production run / limited-time event — where people were anticipating something would be released at a certain specific time, but there was no advance statement from the releaser of exactly what people would be getting. Strong analogy with the NYE "ball drop" — the release is an event that people count down to or line up for. (Think: dropping a new limited-edition colorway of a product people ravenously collect — sneakers, Stanley cups, etc.)
    3. the particular release was a bounded-in-size batch or "tranch" of production, all put out to be purchased at once where "once they sell out, they sell out" for now — but with the expectation that the releaser is producing more, but where this will take time, during which the item will remain sold out. (Often, the item has actually been produced in quantity, and this limited dribbling-out and repeated fast selling-out is purely a marketing technique to induce hype and demand.) This usage isn't a figurative extension of the literal verb "drop" — but rather a shortening of the word "airdrop", as in military resupply and/or NFTs. You would be more likely to see this phrased as "[X] dropped another [Y]" or "[X] dropped more [Y]"; or perhaps "there was a drop of [Y] today."
    SteveDR4 months ago
    Yes, most young people would say an artist “dropped” new music instead of saying that they released new music. Still a bad title though
    rvnx4 months ago
    Dropped could mean abandoned
    0xCMP4 months ago
    I think to be clearer it would have been written "DeepSeek Drops Distributed version of DuckDB". Otherwise it looks like they used DuckDB (the distributed one?) and they have something new or better they're using now.
    KaoruAoiShiho4 months ago
    Dropped could also mean they used to use it but stopped, that's also pretty common parlance in software...
    4 months ago
    undefined
- stavros4 months ago
  Yes but then you lose the alliteration.
  - mritchie7124 months ago
    yes, sorry, I simply couldn't resist
- djeastm4 months ago
  This is one of my "Kids these days..." moments. I've been caught several times mistaking the meaning of this new slang.
  - BHSPitMonkey4 months ago
    Not _so_ new:
    - https://boards.straightdope.com/t/where-did-the-term-album-d... (2009) - https://www.talkbass.com/threads/when-did-release-become-dro... (2013)
    But it _has_ spread much faster outside of the music scene these last few years, e.g. describing software and products.
  - wigster4 months ago
    drop should be un-dropped.
- mritchie7124 months ago
  Sorry, I couldn't resist the tautogram.
- farts_mckensy4 months ago
  It's pretty clear what is meant by anyone under the age of 50.
  - ivandenysov4 months ago
    I’m anyone and it wasn’t clear to me
    farts_mckensy4 months ago
    [flagged]
    throitallaway4 months ago
    Not everyone is immersed in pop culture, not everyone is a native English speaker, etc. It doesn't cost anything to be kind.
mritchie7124 months ago
After posting, I started thinking about how you could push Iceberg (or delta) partitions into smallpond. Spinning up 3FS will be a lot of work, but distributing compute on an existing Iceberg catalog would be worth trying.
maknee4 months ago
What are the results from running smallpond and 3fs?
Are claims valid for <10TB, 10TB -> 1PB and over 1PB?
4 months ago
undefined
xnx4 months ago
"drops" seems to be a fairly recent contronym meaning both "released" and "discontinued".