For each document, there is a secret hidden score "s" which is the "fundamental relevance according to the LLM". Then, when we sample (q, d1, d2) from the LLM, the LLM follows the statistical property that:
- The "fundamental hidden preference" is `pref = s_{d1} - s_{d2}`, usually ranging between -4 and 4.
- The LLM will sample a normal distribution around the `pref` with stddev ~0.2, which is some "inner noise" that the LLM experiences before coming to a judgement.
- The preference will pass through the sigmoid to get a sampled_score \in [0, 1].
- There is an additional 2% noise. i.e., 0.98 * sampled_score + 0.02 * random.random()
When we use Maximum Likelihood Estimation to find the most likely predicted "hidden scores" \hat{s} associated with each document, then we go ahead and sample pairwise matrices according to `0.98 * sigmoid( \hat{s}_1 - \hat{s}_2 + N(0, 0.02) ) + Uniform(0.02)`, then we get a pairwise matrix with virtually identical statistical properties to the observed pairwise matrices.
1) 0.02 * random.random() != N(0, 0.02)
2) The LLM will sample a normal distribution, this only depends on your c parameter, the absolute scale doesn't matter neither in Bradley-Terry nor in Elo. So saying +-4 and claiming LLM reasoning in Standard normal is ridiculous.
3) > then we get a pairwise matrix with virtually identical statistical properties to the observed pairwise matrices. >>> then did you asked yourselves if I have "statistically identical" pair-wise matrix and observed pairwise matrix, the. why you even bother myself? You can simply use observed pairwise matrix...
But, we did need to work on numeric stability!
I have our calculations here: - https://hackmd.io/@-Gjw1zWMSH6lMPRlziQFEw/B15B4Rsleg
tldr; wikipedia iterates on <e^elo>, but that can go to zero or infinity. Iterating on <elo> stays between -4 and 4 in all of our observed pairwise matrices, so it's very well-bounded.
Code: https://github.com/Neywiny/merge-sort Conference/abstract presentation: https://www.spiedigitallibrary.org/conference-proceedings-of...
It was actually done to counter Elo based approaches so there's some references in the readme on how to prove who's better. I haven't run this code in 5 years and haven't developed on it in maybe 6, but I can probably fix any issues that come up. My co-author looks to have diverged a bit. Haven't checked out his code. https://github.com/FrankWSamuelson/merge-sort . There may also be a fork by the FDA itself, not sure. This work was done for the FDA's medical imaging device evaluation division
It's definitely the best if not only way to get extremely high signal, and a score assignment that actually converges the more you sample.
In terms of the "F" in 2AFC, we actually have this amusing snippet from our prompt:
> Do NOT output a score of 0.0, ensure to focus on which document is superior, and provide a negative or positive float between -1.0 and 1.0.
It's not cheap and it's not fast, but it definitely works pretty well!
Kind of a lot of work compared to just dumping the text of 2 profiles into a context window along with a vague description of what I want, and having the LLM make the binary judgment.
When we were exploring the mathematical foundations, we considered ELO scoring against a "Universal Corpus" based on the natural entropy of human language (Obviously that's intractable, but sometimes this term cancels out like in the DPO proof).
But eventually we figured out a method using cross-query comparisons to assign an "ELO bias" to all document ELOs within a given query's candidate list. This normalizes it correctly such that when a candidate list is all bad, the ELOs shift low. And when the candidate list is all good, the ELOs shift high. Even when the relative ELOs are all the same.
There are some conceptual gaps and this sentence is misleading in general. First, this sentence implies that Bradley-Terry is a some sort of an Elo variant, which is not true. Elo rating introduced nearly 10 years later in completely different domain.
They are two completely different ranking systems. Bradley-Terry use ratio-based, while Elo use logistic score function. Scales of the scores are completely different as well as the their sensitivity to the score differences.
Possibly, Bradley-Terry is preffered by the authors due to simpler likelihood evaluation and update doesn't depend on the order of pairwise evaluations.
There is also variants of Elo-rating that use MLE (optimized Elo) and even recently Bayesian Elo. For post-hoc time invariant scores, there is randomized Elo rating and so on.
People like Elo ratings because they are simple to understand. Most of the time, they forget why they developed specifically for chess tournaments. All variants above and dozens more try to improve (fix) one aspect of the Elo ratings, because their application has no 100% clear determination of winner, the update scale parameter is too small or large, matches are played simultaneously, different matches played and so on.
Also, let say one document is always preffered one all LLMs then it has only wins, then MLE will result in flat marginal likelihood for that where the update parameter (c) will inf.
LambdaMART's approach seems better in that respect.
https://medium.com/@nikhilbd/pointwise-vs-pairwise-vs-listwi...
I like the pairwise approach but in the field I'm interested in, at the document level there can be a lot of relevance (we historically use scoring based on TF-IDF) but we tend to get a corpus of documents that then need involved human analysis to give the relevant sections. It seems that paragraph-level vectors are probably at the right conceptual level for refinement.
Ultimately I guess, what is considered a document is somewhat arbitrary. But I wondered if you'd looked at - or if someone here knows about - MLs for retrieval that consider documents at a mix of conceptual levels to improve retrieval. So, pairwise paragraph-level after a broader retrieval would be a simple example.
I guess for looking at CV/resumes that might relate to finding someone who was gardener at Google and then later used ML for graphic design, vs someone who did ML at Google ... which might be a similar document vector (poor example, but you get the picture).
Currently I'm seeing document level references to source material, snippets based on keywords, but not paragraph level referencing as you'd have for legal decisions.
It's such a great and simple algorithm. I feel like it deserves to be more widely known.
I used it at Dyson to evaluate really subjective things like how straight a tress of hair is - pretty much impossible to say if you just look at a photo, but you can ask a bunch of people to compare two photos and say which looks straighter, then you can get an objective ranking.
In our training pipeline, we had to convert the fixed point iteration to be on _ directly for numerical stability. I have a post on that here!: https://hackmd.io/x3_EkXGKRdeq-rNHo_RpZA
Bradley-Terry also very cleanly turns into a loss function that you can do gradient descent on, which will cause your model to efficiently learn Elo scores! Our calculations are at: https://hackmd.io/eOwlF7O_Q1K4hj7WZcYFiw
So don't say it "E.L.O." (unless you're talking about the band, I guess), say "ee-low"
The link in the article to the full blog explaining rerankers is 404ing for me.
Questions to you as an expert related to search ranking. With o3 and source quality thresholds when performing web search. Could we implement an ELO-style cutoff where systems default to “I don’t know” rather than citing low-ranked sources?
Currently o3’s main weakness is mixing high-quality sources with poor ones when it uses the web search in the same response. The answer sounds authoritative throughout, but parts are backed by unreliable sources. This makes it harder to trust even the well-sourced portions (e.g. believing the US election is next year - not a hallucination but a poorly date formatted source it used). It also makes the response a lot slower.
Would a hard quality threshold be better than the current approach of seamlessly blending good and bad sources?
My questions: what languages do your models currently support? Did you perform multilingual benchmarks? Couldn't find an answer on the website
For a slightly different take using a similar intuition, see our paper [at ACL 2024](https://arxiv.org/abs/2402.14860) on ranking LLMs which may be of interest.
Our HuggingFace space has some examples: https://huggingface.co/spaces/ibm/llm-rank-themselves
We found that MSE after elo-adjustment worked equally well. And, MSE lets you shuffle (q, d) across the dataset which has good statistical properties (Versus contrastive, which makes you sample the same query many times within a single minibatch)
In this case "InfoNCE" isn't applicable because the reranker's output is a scalar, not a vector. So that's why we checked both bradley-terry and MSE.
I like that it works with `sentence_transformers`
Edit: ok, done. Submitted title was "Show HN: Improving RAG with chess Elo scores".
We did experiment with a Bradley-Terry loss function (https://hackmd.io/eOwlF7O_Q1K4hj7WZcYFiw), but we found that even better was to calculate Elo scores, do cross-query bias adjustment, and then MSE loss to predict the Elo score itself.
-> The fundamental presumption is the same Thurstone model The Thurstone model is similar, and as you said it assumes normal (as opposed to logistic) using probit link function. It predates both models and due to computational constraints, you can call Bradley-Terry and Elo rating computationally convenient approximation of the Thurstone model.
-> We did experiment with a Bradley-Terry loss function (https://hackmd.io/eOwlF7O_Q1K4hj7WZcYFiw) The math is correct. Thanks for sharing. Indeed, if you do it with incremental updating, you will lose the differentiability given the next winning probability is dependent on the previous updates. Call it what you want, but note that this is not truly and Elo rating which leads misunderstanding. It is Bradley-Terry given you do batch updates which you take extra steps to connect with Elo score, as shown in the link.
Lastly, normal and logistic distribution will lead to log(0) in evaluations which results inf in loss. As I can see from you upper comment, you try add uniform(0.02) as ad-hoc fix. An elegant fix to that is use heavy-tailed distribution such as Cauchy.