52 pointsby HermitX11 hours ago3 comments
  • j-pb5 hours ago
    Whenever I read join optimisation articles in SQL based systems it feels... off.

    There is too much heuristic fiddling involved, and way too many niche algorithms that get cobbled together with an optimiser.

    As if we're missing the theory to actually solve the stuff, so we're instead hobbling along by covering as many corner cases as we can, completely missing some elegant and profound beauty.

    • mamcx2 hours ago
      Because a SQL query encompasses an arbitrary combination of MANY different sub-programs that are expected to be auto-solved.

      Attempt to implement them, manually, and you see how hard is it.

      PLUS, not only you need to account for the general solution, but what could be the best considering the current data set.

      And, you can't compile statically (dynamically sure).

      And, should work interactively, so hopefully be solved faster than run the actual query.

      P.D: Joins are normally the focus, but other constructs are also challenging. For example, just solving if and which indexes to pick can be challenging when you have dozens of predicates.

      And yes, your optimizer should survive(eventually!) to solve when you get feed hundreds of joins, predicates, aggregates, sorting and arbitrary expressions.

      * I worked in the optimizer of a database. EVERYTHING is tricky!

    • sethev2 hours ago
      There have been hints in the research that this might be the case-but so far they haven't really beaten the heuristic approach in practice (outside of special cases).

      For example there's a class of join algorithms called 'worst-case optimal' - which is not a great name, but basically means that these algorithms run in time proportional to the worst-case output size. These algorithms ditch the two at a time approach that databases typically use and joins multiple relations at the same time.

      One example is the leapfrog trie join which was part of the LogicBlox database.

    • jasonwatkinspdx5 hours ago
      Optimal join order is NP-Hard.
      • nine_k3 hours ago
        It must be equivalent to the knapsack problem, for which many faster close-to-optimal algorithms exist. Am I missing something?
        • layer83 hours ago
          It’s not equivalent. Knapsack is weakly NP-hard, while optimal join order is strongly NP-hard. Also, algorithms that only approximate an optimal solution don’t generally carry over between NP-hard problems, unless they are structurally very similar.
    • dapperdrake4 hours ago
      This really depends on your data's geometry.

      Just look at when SAS programmers are advised to use a merge or a format.

      Even hash-join vs merge-join really depend on your data's cardinality (read: sizes), indices, etc.

      EDIT: Other comments also point out that there are non-general joins that are already NP-hard to optimize. You really want all the educated guesses you can get.

    • Sesse__5 hours ago
      This post certainly has too much heuristic fiddling! Instead of a coherent framework, it takes a bunch of second-rate heuristics and tries to use… well, all of them. “Generate at most ten plans of this and one of that”? It also has pages and pages talking about the easier parts, for some reason (like maintaining maps, or that a Cartesian product and an inner join are basically the same thing), and things that are just wrong (like “prefer antijoins”, which is bad in most databases since they are less-reorderable than almost any other join; not that you usually have much of a choice in choosing the join type in the first place).

      There _are_ tons of corner cases that you need to address since there are some super-hard problems in there (in particular, robust cardinality estimation of join outputs is a problem so hard that most of academia barely wants to touch it, despite its huge importance), but it doesn't need to be this bad.

      • willvarfar4 hours ago
        Can join cardinality can be tackled with cogroup and not expanding the rows until final write?
    • akoboldfrying4 hours ago
      Well, it's to be expected that heuristics are needed, since the join ordering subproblem is already NP-hard -- in fact, a special case of it, restricted to left-deep trees and with selectivity a function of only the two immediate child nodes in the join, is already NP-hard, since this is amounts to the problem of finding a lowest-cost path in an edge-weighted graph that visits each vertex exactly once, which is basically the famous Traveling Salesperson Problem. (Vertices become tables, edge weights become selectivity scores; the only difficulty in the reduction is dealing with the fact that the TSP wants to include the cost of the edge "back to the beginning", while our problem doesn't -- but this can be dealt with by creating another copy of the vertices and a special start vertex, ask me for the details if you're interested.)
    • RaftPeople4 hours ago
      > completely missing some elegant and profound beauty.

      Requires some dynamic SQL to construct, but the beauty is that you can use the SQL engine for this solution:

      select top 1 *

      from (select

      t1.id,t2.id,...,tn.id

      ,sum(t1.cost+t2.cost...+tn.cost) as total_cost

      from join_options t1

      cross join join_options t2

      ...

      cross join join_options tn

      group by t1.id,t2.id,...,tn.id) t0

      order by

      t0.total_cost

  • quapster2 hours ago
    [dead]
  • Ibrahim265 hours ago
    Yeah Bro I'm a student, so let me learn something new .