4 pointsby Tomte9 hours ago1 comment
  • PaulHoule9 hours ago
    When you convert BM25 or other classical IR scores to a probability with this toolbox:

    https://scikit-learn.org/stable/modules/calibration.html

    you find a lot of things that are unsatisfying such as you never got a relevance score better than p=0.7 or so and that's very very rare. There are many specific problems in IR for which that kind of probability would be really helpful such as combining results that came from different sources or returning a stream of new documents from a collection but it was an early decision in TREC to not reward ranking functions that were good probability estimators or even that are good at the top-1 or top-3 positions but rather reward them for still being enriched in relevant results when you go deep (like 1000 results deep) into the results.

    • softwaredoug9 hours ago
      Interesting! I did not know this about TREC's decision or the scipy calibration module.