SELECT *
FROM Population
WHERE weight > 0
ORDER BY -LN(1.0 - RANDOM()) / weight
LIMIT 100 -- Sample size.
Can anyone from ClickHouse verify that the lazy-materialization optimization speeds up queries like this one? (I want to make sure the randomization in the ORDER BY clause doesn't prevent the optimization.) EXPLAIN plan actions = 1
SELECT *
FROM amazon.amazon_reviews
WHERE helpful_votes > 0
ORDER BY -log(1 - (rand() / 4294967296.0)) / helpful_votes
LIMIT 3
Lazily read columns: review_body, review_headline, verified_purchase, vine, total_votes, marketplace, star_rating, product_category, customer_id, product_title, product_id, product_parent, review_date, review_idNote that there is a setting query_plan_max_limit_for_lazy_materialization (default value 10) that controls the max n for which lm kicks in for LIMIT n.
Do you know any example query where lazy materialization is detrimental to performance?
However, I managed to look besides that, and oh-my-god it is so fast. It's like the tool is optimized for raw speed and whatever you do with it is up for you.
I get why some create new dialects and languages as that way there is less ambiguity and therefore harder to use incorrectly but I think ClickHouse made the right tradeoffs here.
At least in terms of capability and reputation it was already well known by 2021 and certainly not legacy or bulky. At least on HN clickhouse is very often submitted and reached front page. Compared to MySQL when I tried multiple times no one is interested.
Edit: On another note Umami is finally supporting Clickhouse! [1], Not sure how they implementing it because it still requires Postgres. But it should hopefully be a lot more scalable.
"this query sorts all 150 million values in the helpful_votes column (which isn’t part of the table’s sort key) and returns the top 3, in just 70 milliseconds cold (with the OS filesystem cache cleared beforehand) and a processing throughput of 2.15 billion rows/s"
I clearly need to update my mental model of what might be a slow query against modern hardware and software. Looks like that's so fast because in a columnar database it only has to load that 150 million value column. I guess sorting 150 million integers in 70ms shouldn't be surprising.
(Also "Peak memory usage: 3.59 MiB" for that? Nice.)
This is a really great article - very clearly explained, good diagrams, I learned a bunch from it.
I find sorting 150M integers at all to be surprising. The query asks for finding the top 3 elements and returning those elements, sorted. This can be done trivially by keeping the best three found so far and scanning the list. This should operate at nearly the speed of memory and use effectively zero additional storage. I don’t know whether Clickhouse does this optimization, but I didn’t see it mentioned.
Generically, one can find the kth best of n elements in time O(n):
https://en.m.wikipedia.org/wiki/Selection_algorithm
And one can scan again to find the top k, plus some extra if the kth best wasn’t unique, but that issue is manageable and, I think, adds at most a factor of 2 overhead if one is careful (collect up to k elements that compare equal to the kth best and collect up to k that are better than it). Total complexity is O(n) if you don’t need the result sorted or O(n + k log k) if you do.
If you’re not allowed to mutate the input (which probably applies to Clickhouse-style massive streaming reads), you can collect the top k in a separate data structure, and straightforward implementations are O(n log k). I wouldn’t be surprised if using a fancy heap or taking advantage of the data being integers with smallish numbers of bits does better, but I haven’t tried to find a solution or disprove the existence of one.
Overall clickhouse reads blocks of fixed sizes (64k) and finds top elements and then does top of the top until it converges.
[1] https://danlark.org/2020/11/11/miniselect-practical-and-gene...
That doesnt seem to guarantee correctness. If you dont track all of the unique values, at least, you could be throwing away one of the most common values.
The wiki entry seems to be specifically about the smallest, rather than largest values.
As noted a couple times in this thread, there are all kinds of tradeoffs here, and I can’t imagine quickselect being even close to competitive for k that is small enough to fit in cache. Quickselect will, in general, scan a large input approximately twice. For k = 3, the answer fits in general-purpose registers or even in a single SIMD register, and a single scan with brute force accumulation of the answer will beat quickselect handily and will also beat any sort of log-time heap.
(In general, more advanced and asymptotically better algorithms often lose to simpler brute force algorithms when the parameters in question are smallish.)
However, you can find the top k elements in O(n) time and O(k) space in a single pass.
One simple way: you keep a buffer of up to 2*k elements. You scan your stream of n items one by one. Whenever your buffer gets full, you pare it back down to k elements with your favourite selection algorithm (like quickselect).
As a minor optimisation, you can only add items to your buffer, if they improve on the worst element in your buffer (or when you haven't hit k elements in your buffer, yet).
As an empirical question, you can also experiment with the size of the buffer. Theoretically any multiple of k will do (even 1.1*k or so), but in practice they give you different constant factors for space and time.
if x > worst then worst = x
The deletion comes in a big batch, where we don't mind paying a linear cost to prune and rebuild our bucket.
Oh, and in our case it's even simpler: the worst element in our buffer only updates during the pruning phase. By construction, we otherwise only ever insert elements that are better than the worst, so we don't have to update the worst. (Outside of the first k elements that we all take anyway to get started. But if you want, you can handle that as a special case.)
https://en.wikipedia.org/wiki/Streaming_algorithm#Frequent_e...
So if someone tells you that one item in the stream is repeated so often that it occurs at least p% of the time (say 10%), then these algorithms can find such an element. But eg if they are multiple elements that occur more than p% of the time, they are not guaranteed to give you the one that occurs the most often. Nor are they guaranteed to give you any meaningful output, if the assumption is violated and no element occurs at least p% of the time.
The basic idea is to maintain a buffer of size 2k, run mutable unsorted top k on that, drop the smaller half (i.e the lowest k elements), then stream in the next k elements from the main list. Each iteration takes O(k), but you’re processing k elements at a time, so overall runtime is O(n).
When you’re done, you can of course sort for an additional k*log(k) cost.
Memory like DDR4 can do 25GB/s [2]. It can go over 600MB in 600MB / 25,000MB/s = 24ms.
L1/L2 can do 1TB/s [3]. There're 32 CPU's, so it's roughly 32TB/s of L1/L2 bandwidth. 600MB can be processed by 32TB/s in 0.018ms. With 3ms budget, they can process the 600MB data 166 times.
The rank selection algorithms like QuickSelect and Floyd-Rivest have O(N) complexity. It's entirely possible to process 600MB in 70ms.
[1] https://www.tomshardware.com/features/ssd-benchmarks-hierarc...
[2] https://www.transcend-info.com/Support/FAQ-292
[3] https://www.intel.com/content/www/us/en/developer/articles/t...
You could host so much from your macbook. The average HN startup could be hosted on a $200 minipc from a closet for the first couple of years if not more - and I'm talking expensive here for the extra RAM you want to not restart every hour when you have a memory leak.
The real problem is the lack of understanding by most engineers the degree of overprovisioning they do for code that's simple and doing stupid things using an inefficient 4th order language on top of 5 different useless (imo) abstractions.
/s
I've seen Spark clusters being replaced by a single container using less than 1 CPU core and few 100s MB of RAM.
But you actually need more than compute. You might need a database, cache, message broker, scheduler, to send emails, and a million other things you can always DIY with FOSS software, but take time. If you have more money than time, get off the shelf services that provide those with guarantees and maintenance; if not, the DIY route is also great for learning.
At least on cloud I can actually have hundreds of GiBs of RAM. If I want this on my Macbook it's even more expensive than my cloud bill.
https://dspace.mit.edu/bitstream/handle/1721.1/34929/MIT-CSA...
It's awesome that clickhouse is adopting it now, but a shame that it's not standard on anything that does analytics processing.
[1]https://en.wikipedia.org/wiki/Usage_share_of_operating_syste...
https://clickhouse.com/blog/chdb-embedded-clickhouse-rocket-...
What a nice touch. Technical information and diagrams in this were top notch, but the fact there was also some kind of narrative threaded in really put it over the top for me.
I'm pretty sure they did not even bother to properly compress the dataset, with some tweaking, could have probably been much smaller than 30GBs. The speed shows that reading the data is slower than decompressing it.
Reminds me of that Cloudflare article where they had a similar idea about encryption being free (slower to read than to decrypt) and finding a bug, that when fixed, materialized this behavior.
The compute engine (chdb) is a wonder to use.
They're not "doing something wrong". They are designed differently for different target workloads.
Row-based -> OLTP -> "Fetch the entire records from order table where user_id = XYZ"
Column-based -> OLAP -> "Compute the total amount of orders from the order table grouped by month/year"
It’s transactions mostly that make things slow. Like various isolation levels, failures if stale data was read in a transaction etc.
I understand the difference, just a shame there’s nothing close to read or write rate , even on an index structure that has a copy of the columns.
I’m aware that similar partitioning is available and that improves write and read rate but not to these magnitudes .
But credit where it is due, obviously clickhouse is an industry leader.
The CH contributors are really stellar, from multiple companies (Altinity, Tinybird, Cloudflare, ClickHouse)
Ps. I work for ClickHouse
Contribution is strong and varied enough that I think we're good for the long term.
duckdb has unfortunately been leaning away from pure oss - the ui they released is entirely hosted on motherduck’s servers (which, while an awesome project, makes me feel like the project will be cannibalized by a proprietary extensions.)
You can scale the pure open source project really far. And if you need more, you do have money to pay for it.
People need to stop thinking that open source is free leech. Open source is about sharing knowledge and building trust between each other. But we are still living in a business world where we are competing.
I think people should be aware that the Core Storage work is not beeing done in the open anymore. The process of letting the open source core grow useless while selling proprietary addons is fine. I just have a problem with people calling this open source.
Some prefer using open source. Are they leeches ?
Regarding scaling open source projects, did Linux scale ?
who is they in this sentence ? afaik UI was released by motherduck a private company.
But we chose ClickHouse and now we just pump in data with little to no optimization.
This is how we consume Langfuse traces!