Challenges in join optimization(www.starrocks.io)

70 pointsby HermitX17 days ago3 comments

j-pb17 days ago
Whenever I read join optimisation articles in SQL based systems it feels... off.
There is too much heuristic fiddling involved, and way too many niche algorithms that get cobbled together with an optimiser.
As if we're missing the theory to actually solve the stuff, so we're instead hobbling along by covering as many corner cases as we can, completely missing some elegant and profound beauty.
- mamcx17 days ago
  Because a SQL query encompasses an arbitrary combination of MANY different sub-programs that are expected to be auto-solved.
  Attempt to implement them, manually, and you see how hard is it.
  PLUS, not only you need to account for the general solution, but what could be the best considering the current data set.
  And, you can't compile statically (dynamically sure).
  And, should work interactively, so hopefully be solved faster than run the actual query.
  P.D: Joins are normally the focus, but other constructs are also challenging. For example, just solving if and which indexes to pick can be challenging when you have dozens of predicates.
  And yes, your optimizer should survive(eventually!) to solve when you get feed hundreds of joins, predicates, aggregates, sorting and arbitrary expressions.
  * I worked in the optimizer of a database. EVERYTHING is tricky!
- akoboldfrying17 days ago
  Well, it's to be expected that heuristics are needed, since the join ordering subproblem is already NP-hard -- in fact, a special case of it, restricted to left-deep trees and with selectivity a function of only the two immediate child nodes in the join, is already NP-hard, since this is amounts to the problem of finding a lowest-cost path in an edge-weighted graph that visits each vertex exactly once, which is basically the famous Traveling Salesperson Problem. (Vertices become tables, edge weights become selectivity scores; the only difficulty in the reduction is dealing with the fact that the TSP wants to include the cost of the edge "back to the beginning", while our problem doesn't -- but this can be dealt with by creating another copy of the vertices and a special start vertex, ask me for the details if you're interested.)
  - akoboldfrying17 days ago
    Self-nitpick (too late to edit my post above): I used the phrase "special case" wrongly here -- restricting the valid inputs to a problem creates a strictly-no-harder special case, but constraining the valid outputs (as I do here regarding left-deep trees) can sometimes actually make the problem harder -- e.g., integer linear programming is harder than plain "fractional" linear programming.
    So it's possible that the full optimisation problem over all join tree shapes is "easy", even though an output-constrained version of it is NP-hard... But I think that's unlikely. Having an NP-hard constrained variant like this strongly suggests that the original problem is itself NP-hard, and I suspect this could be shown by some other reduction.
    > with selectivity a function of only the two immediate child nodes in the join
    This should be "with selectivity a function of the rightmost leaves of the two child subtrees", so that it still makes sense for general ("bushy") join trees. (I know, I'm talking to myself... But I needed to write this down to convince myself that the original unconstrained problem wasn't just the (very easy) minimum spanning tree problem in disguise.)
- sethev17 days ago
  There have been hints in the research that this might be the case-but so far they haven't really beaten the heuristic approach in practice (outside of special cases).
  For example there's a class of join algorithms called 'worst-case optimal' - which is not a great name, but basically means that these algorithms run in time proportional to the worst-case output size. These algorithms ditch the two at a time approach that databases typically use and joins multiple relations at the same time.
  One example is the leapfrog trie join which was part of the LogicBlox database.
- mihailstoian10 days ago
  If you're referring to estimating join sizes, i.e., the stuff you have to estimate before you actually build the query plan, we're _almost_ there (but not yet). Do check out the following papers that show that you can obtain provable bounds on your join sizes. Basically, given a SQL query, they'll tell you how many tuples (max and min, respectively) the query will return.
  1. LpBound: join size upper bounds. It still doesn't have full SQL coverage, e.g., string predicates, window functions, subqueries etc., but as with all cool stuff, it takes time to build it.
  2. xBound: join size lower bounds. We showed how to do it at least for multi-way joins on the same join key, e.g., many subexpressions of the JOB benchmark have this shape. Still open how to do the rest - I'd say even harder than for upper bounds! (NB: I'm an author.)
  [1] LpBound: https://arxiv.org/abs/2502.05912
  [2] xBound: https://arxiv.org/abs/2601.13117
- jasonwatkinspdx17 days ago
  Optimal join order is NP-Hard.
  - maweki17 days ago
    In light of that, I am wondering why the article opted to go for "However, determining the optimal join order is far from trivial.", when there are hard results in literature.
    I was also missing mentioning "sideways information passing", though some of methods are exactly that.
    I am wondering whether the company consults literature or whether they fiddle about, mostly reinventing the wheel.
  - Sesse__17 days ago
    TBH that's not the hard part about it. N is the number of tables, and for most real queries, N < 20 and even 2^N (clique join, which almost never happens in practice) would be tractable if you didn't have so many other things entering the mix. Most queries are closer to chain joins, which have only O(n³) possible join orders (assuming dynamic programming). (Obviously as N grows, you'll need to add some heuristics to prune out “obviously” bad plans. There are many different approaches.)
    The really hard problem is estimating the cost of each plan once you've generated it, which necessarily must happen by some sort of heuristics combined with statistics. In particular: If you want to join A, B and C, the cost of (A JOIN B) JOIN C versus A JOIN (B JOIN C) can differ by many orders of magnitude, depending on the size of the intermediate products. And both the cost effects and any misestimation tend to compound through the plan as you add more tables and predicates.
  - nine_k17 days ago
    It must be equivalent to the knapsack problem, for which many faster close-to-optimal algorithms exist. Am I missing something?
    layer817 days ago
    It’s not equivalent. Knapsack is weakly NP-hard, while optimal join order is strongly NP-hard. Also, algorithms that only approximate an optimal solution don’t generally carry over between NP-hard problems, unless they are structurally very similar.
- Sesse__17 days ago
  This post certainly has too much heuristic fiddling! Instead of a coherent framework, it takes a bunch of second-rate heuristics and tries to use… well, all of them. “Generate at most ten plans of this and one of that”? It also has pages and pages talking about the easier parts, for some reason (like maintaining maps, or that a Cartesian product and an inner join are basically the same thing), and things that are just wrong (like “prefer antijoins”, which is bad in most databases since they are less-reorderable than almost any other join; not that you usually have much of a choice in choosing the join type in the first place).
  There _are_ tons of corner cases that you need to address since there are some super-hard problems in there (in particular, robust cardinality estimation of join outputs is a problem so hard that most of academia barely wants to touch it, despite its huge importance), but it doesn't need to be this bad.
  - willvarfar17 days ago
    Can join cardinality can be tackled with cogroup and not expanding the rows until final write?
    Sesse__17 days ago
    I don't know what cogroup is, sorry.
    More generally, there are algorithms for multi-way joins (with some theoretical guarantees), but they tend to perform worse in practice than just a set of binary joins with a good implementation.
    willvarfar17 days ago
    Yeah it's pretty obscure, sorry.
    It's called cogroup in Spark and similar architectures.
    It does a group-by to convert data into the format (key_col_1, ... key_col_n) -> [(other_col_1, ... other_col_n), ...]
    This is useful and ergonomic in itself for lots of use-cases. A lot of Spark and similar pipelines do this just to make things easier to manipulate.
    Its also especially useful if you cogroup each side before join, which gives you the key column and two arrays of matching rows, one for each side of the join.
    A quick search says it's called "group join" in academia. I'm sure I've bumped into as another name in other DB engines but can't remember right now.
    One advantage of this is that it is bounded memory. It doesn't actually iterate over the cartesian product of non-unique keys. In fact, the whole join can be done on pointers into the sides of the join, rather than shuffling and writing the values themselves.
    My understanding is that a lot of big data distributed query engines do this, at least in mixer nodes. Then the discussion becomes how late they actually expand the product - are they able to communicate the cogrouped format to the next step in the plan or must they flatten it? Etc.
    (In SQL big data engines sometimes you do this optimisation explicitly e.g. doing SELECT key, ARRAY_AGG(value) FROM ... on each side before join. But things are nicer when it happens transparently under the hood and users get the speedup without the boilerplate and brittleness and fear that it is a deoptimisation when circumstances change in the future.)
    Sesse__17 days ago
    Group join in academia generally points to having GROUP BY and one join in the same operation (since it's common to having aggregation and at least one join on the same attribute(s)). But just making a hash table on each side doesn't really do anything in itself (although making it on _one_ side is the typical start of a classic hash join); in particular, once you want to join on different keys, you have to regroup.
- 10000truths16 days ago
  I think it's natural to be uncomfortable with black-box optimizations with unclear boundary conditions, especially when there's no escape hatch or override. Postgres's query planner is notorious for this - some small change in a table's computed statistic shifts the cost estimate ever so slightly to favor some other "optimization" that actually ends up performing significantly worse. It happens rarely, but no one wants to "rarely" be paged at 3 AM on a Saturday.
  Optimize for the p99/p99.9/worst case scenarios. Minimize unpredictability in performance where possible, even if it comes at a small cost of median/average performance. Your SREs will thank you.
- dapperdrake17 days ago
  This really depends on your data's geometry.
  Just look at when SAS programmers are advised to use a merge or a format.
  Even hash-join vs merge-join really depend on your data's cardinality (read: sizes), indices, etc.
  EDIT: Other comments also point out that there are non-general joins that are already NP-hard to optimize. You really want all the educated guesses you can get.
- imtringued17 days ago
  The optimal SQL query plan is data dependent. It depends on both the contents of the database and the query parameters. Since people expect their queries to be executed with minimal latency, there is no point in trying to waste significant resources on trying to find the optimal query plan.
- RaftPeople17 days ago
  > completely missing some elegant and profound beauty.
  Requires some dynamic SQL to construct, but the beauty is that you can use the SQL engine for this solution:
  select top 1 *
  from (select
  t1.id,t2.id,...,tn.id
  ,sum(t1.cost+t2.cost...+tn.cost) as total_cost
  from join_options t1
  cross join join_options t2
  ...
  cross join join_options tn
  group by t1.id,t2.id,...,tn.id) t0
  order by
  t0.total_cost
quapster17 days ago
Joins are where the abstraction leak between “relational algebra” and “physics of the cluster” becomes impossible to ignore.
On paper, join order is a combinatorial search over equivalent expressions. In reality, you’re optimizing over three very non-relational constraints: data distribution (where bytes actually live), cardinality estimation (how wrong your stats are), and memory/network contention (what everyone else is running). That’s why so many OLAP setups quietly give up and denormalize: not because joins are conceptually hard, but because getting good enough plans under bad stats and skewed data is brutally hard and very user-visible when it fails.
What’s interesting about systems like StarRocks, ClickHouse, DuckDB, etc is that they’re implicitly making a bet: “we can push the optimizer and execution engine far enough that normalized schemas become operationally cheaper than the hacks (wide tables, pre-joined materializations, bespoke streaming DAGs).” If that bet holds, the real win isn’t just faster joins, it’s shifting complexity back from application-specific pipelines into a general-purpose optimizer that can be improved once and benefit everyone.
The irony is that the more powerful the optimizer, the more your “logical” schema becomes a performance API surface. A small change in constraints, stats collection, or distribution keys can be worth more than any new feature, but it’s also harder to reason about than “this table is pre-joined.” So we’re trading one kind of complexity (manual denormalization and backfills) for another (making the cost model and distribution-aware planner smart enough to not shoot you in the foot).
Ibrahim2617 days ago
[flagged]