Show HN: Improving search ranking with chess Elo scores(www.zeroentropy.dev)

193 pointsby ghita_7 months ago22 comments

Alex39177 months ago
Out of curiosity, is there a reason why you are using ELO proper, rather than one of the ELO variants that doesn't make assumptions about the distribution of results? E.g.:
https://github.com/pfmonville/whole_history_rating
- npip997 months ago
  Hey! We actually did a lot of research into ELO consistency, i.e. to check whether or not the NxN pairwise matrix followed the ELO model. It was a long road that's probably grounds for an entirely separate blog post, but the TLDR is that we observe that:
  For each document, there is a secret hidden score "s" which is the "fundamental relevance according to the LLM". Then, when we sample (q, d1, d2) from the LLM, the LLM follows the statistical property that:
  - The "fundamental hidden preference" is `pref = s_{d1} - s_{d2}`, usually ranging between -4 and 4.
  - The LLM will sample a normal distribution around the `pref` with stddev ~0.2, which is some "inner noise" that the LLM experiences before coming to a judgement.
  - The preference will pass through the sigmoid to get a sampled_score \in [0, 1].
  - There is an additional 2% noise. i.e., 0.98 * sampled_score + 0.02 * random.random()
  When we use Maximum Likelihood Estimation to find the most likely predicted "hidden scores" \hat{s} associated with each document, then we go ahead and sample pairwise matrices according to `0.98 * sigmoid( \hat{s}_1 - \hat{s}_2 + N(0, 0.02) ) + Uniform(0.02)`, then we get a pairwise matrix with virtually identical statistical properties to the observed pairwise matrices.
  - slybot7 months ago
    More confused,
    1) 0.02 * random.random() != N(0, 0.02)
    2) The LLM will sample a normal distribution, this only depends on your c parameter, the absolute scale doesn't matter neither in Bradley-Terry nor in Elo. So saying +-4 and claiming LLM reasoning in Standard normal is ridiculous.
    3) > then we get a pairwise matrix with virtually identical statistical properties to the observed pairwise matrices. >>> then did you asked yourselves if I have "statistically identical" pair-wise matrix and observed pairwise matrix, the. why you even bother myself? You can simply use observed pairwise matrix...
Neywiny7 months ago
I have a paper that got denied but it was about using 2AFC sorting to do this instead of elo. It has a defined end unlike elo scores. The code is on my github and focuses on humans sorting images but basically if you have a python sort function, you put your comparison as the key instead of assigning the comparison a numeric score. Then the algorithm does the rest
Code: https://github.com/Neywiny/merge-sort Conference/abstract presentation: https://www.spiedigitallibrary.org/conference-proceedings-of...
- ghita_7 months ago
  would love to check out the code if you have it!
  - Neywiny7 months ago
    https://github.com/Neywiny/merge-sort
    It was actually done to counter Elo based approaches so there's some references in the readme on how to prove who's better. I haven't run this code in 5 years and haven't developed on it in maybe 6, but I can probably fix any issues that come up. My co-author looks to have diverged a bit. Haven't checked out his code. https://github.com/FrankWSamuelson/merge-sort . There may also be a fork by the FDA itself, not sure. This work was done for the FDA's medical imaging device evaluation division
- reactordev7 months ago
  I was going to mention this approach as well. The problem with the OP is that it has assumption bias and the entire chain is based on that assumption. It’s novel. But the original idea was to more evenly distribute scores so you can find real relevance and I think 2AFC is better. But I don’t have time to verify and post a paper about it.
  - npip997 months ago
    Yes our pairwise method is based entirely on 2AFC comparisons, for both intra-query and inter-query ELO calculations.
    It's definitely the best if not only way to get extremely high signal, and a score assignment that actually converges the more you sample.
    In terms of the "F" in 2AFC, we actually have this amusing snippet from our prompt:
    > Do NOT output a score of 0.0, ensure to focus on which document is superior, and provide a negative or positive float between -1.0 and 1.0.
    reactordev7 months ago
    Nice, I use an epoch to prevent stalemate but this might be better.
  - Neywiny7 months ago
    It's probably because that's what we used, but nAFC has been my go-to since I first learned about it. Literally any time there's a ranking, even for dumb stuff like tier list videos on YouTube, they're too arbitrary. Ok you ranked this snack an 8/10. Based on what? And then they go back and say "actually I'm going to move that to a 7". AFC fixes all of that.
- fc417fc8027 months ago
  I would have been curious to glance at the paper (poster? whatever it is) but it's paywalled. Is there any particular reason it isn't on arxiv?
scoresmoke7 months ago
You might also consider a fast implementation of Elo and Bradley–Terry that I have been developing for some time: https://github.com/dustalov/evalica (Rust core, Python bindings, 100% test coverage, and nice API).
- swyx7 months ago
  would you consider JS bindings? should be easy to vibe code given what you have. bonus points if it runs in the browser (eg export the wasm binary). thank you!
  - scoresmoke7 months ago
    I am thinking about this for a while and I think I’ll vibecode them. Not sure about WASM, though, as the underlying libraries should support it, too, and I am not sure about all of them at the same time.
- npip997 months ago
  In our case training and inferencing the models takes days, calculating all of the ELOs take 1min haha. So we didn't need to optimize the calculation.
  But, we did need to work on numeric stability!
  I have our calculations here: - https://hackmd.io/@-Gjw1zWMSH6lMPRlziQFEw/B15B4Rsleg
  tldr; wikipedia iterates on <e^elo>, but that can go to zero or infinity. Iterating on <elo> stays between -4 and 4 in all of our observed pairwise matrices, so it's very well-bounded.
  - scoresmoke7 months ago
    I am working on post-training and evaluation tasks mostly, and I built Evalica as a convenient tool for my own use cases. The computation is fast enough to not bother the user, but the library does not stand in my way during the analysis.
ashwindharne7 months ago
Cool stuff! We use a similar process internally to rerank and filter our cold outbound lists. We just use an off-the-shelf model as the judge, give it a custom criteria, and let it run until some set number of iterations. It's helped narrow down wide searches to the maximally relevant set of people (few thousand medium-bad matches to few hundred good matches)
It's not cheap and it's not fast, but it definitely works pretty well!
- jayunit7 months ago
  Very interesting! What are some examples of criteria that you can evaluate pairwise, but couldn't score individually?
  - ashwindharne7 months ago
    It's all unstructured text (title, company, company size, experience, skills, raw text, etc.) and LLMs are pretty bad at assigning numerical scores in a vacuum. To make it work, we'd have to provide a representative set of examples, break scoring down by specific field, etc.
    Kind of a lot of work compared to just dumping the text of 2 profiles into a context window along with a vague description of what I want, and having the LLM make the binary judgment.
  - bravura7 months ago
    Pairwise rank constraints involve fewer assumptions that per-item scoring about the underlying nature of the data, thus they are more robust.
    npip997 months ago
    Yeah that's exactly what we observed. Our goal was to create an absolute score that's completely independent from the Corpus, which is difficult because naturally all ELO distributions are inherently tied to the corpus itself!
    When we were exploring the mathematical foundations, we considered ELO scoring against a "Universal Corpus" based on the natural entropy of human language (Obviously that's intractable, but sometimes this term cancels out like in the DPO proof).
    But eventually we figured out a method using cross-query comparisons to assign an "ELO bias" to all document ELOs within a given query's candidate list. This normalizes it correctly such that when a candidate list is all bad, the ELOs shift low. And when the candidate list is all good, the ELOs shift high. Even when the relative ELOs are all the same.
fulmicoton7 months ago
One trouble I could see with your approach is that you treat the information "Doc at pos i" beats "Doc at pos j" independently from i and j. Intuitively, it is not as critical when a bad doc is at rank 9 instead of rank 10; compared to bad doc landing at rank 1 instead of rank 10.
LambdaMART's approach seems better in that respect.
https://medium.com/@nikhilbd/pointwise-vs-pairwise-vs-listwi...
slybot7 months ago
> Fit an ELO-style rating system (Bradley-Terry) to turn pairwise comparisons into absolute per-document scores.
There are some conceptual gaps and this sentence is misleading in general. First, this sentence implies that Bradley-Terry is a some sort of an Elo variant, which is not true. Elo rating introduced nearly 10 years later in completely different domain.
They are two completely different ranking systems. Bradley-Terry use ratio-based, while Elo use logistic score function. Scales of the scores are completely different as well as the their sensitivity to the score differences.
Possibly, Bradley-Terry is preffered by the authors due to simpler likelihood evaluation and update doesn't depend on the order of pairwise evaluations.
There is also variants of Elo-rating that use MLE (optimized Elo) and even recently Bayesian Elo. For post-hoc time invariant scores, there is randomized Elo rating and so on.
People like Elo ratings because they are simple to understand. Most of the time, they forget why they developed specifically for chess tournaments. All variants above and dozens more try to improve (fix) one aspect of the Elo ratings, because their application has no 100% clear determination of winner, the update scale parameter is too small or large, matches are played simultaneously, different matches played and so on.
Also, let say one document is always preffered one all LLMs then it has only wins, then MLE will result in flat marginal likelihood for that where the update parameter (c) will inf.
pbhjpbhj7 months ago
So this is for recruitment?
I like the pairwise approach but in the field I'm interested in, at the document level there can be a lot of relevance (we historically use scoring based on TF-IDF) but we tend to get a corpus of documents that then need involved human analysis to give the relevant sections. It seems that paragraph-level vectors are probably at the right conceptual level for refinement.
Ultimately I guess, what is considered a document is somewhat arbitrary. But I wondered if you'd looked at - or if someone here knows about - MLs for retrieval that consider documents at a mix of conceptual levels to improve retrieval. So, pairwise paragraph-level after a broader retrieval would be a simple example.
I guess for looking at CV/resumes that might relate to finding someone who was gardener at Google and then later used ML for graphic design, vs someone who did ML at Google ... which might be a similar document vector (poor example, but you get the picture).
Currently I'm seeing document level references to source material, snippets based on keywords, but not paragraph level referencing as you'd have for legal decisions.
seanhunter7 months ago
Fun fact about ELO. It's natural to think that it is some kind of initialism, but in fact ELO doesn't stand for anything. It's the name of the guy who invented the system. https://en.wikipedia.org/wiki/Arpad_Elo
So don't say it "E.L.O." (unless you're talking about the band, I guess), say "ee-low"
- esafak7 months ago
  It should be Elo rating! https://en.wikipedia.org/wiki/Elo_rating_system
- fvdessen7 months ago
  Similar to the 'Gini coefficient', named after Corrado Gini, former president of the Italian Genetics and Eugenics Society and author of 'The Scientific Basis of Facism'
  https://en.wikipedia.org/wiki/Corrado_Gini
- kayge7 months ago
  Thanks for this :) I had never heard of Elo until I noticed this morning that the new Chess course in Duolingo gives you an Elo ranking after a few rounds against Oscar. Probably would have skipped right over this story and comments otherwise, but now I have a fun bit of non-tech trivia to share if it ever comes up in small talk someday.
  - rurban7 months ago
    In table-tennis we also use the ELO ranking, because it's pretty fair. If you loose to a good player you don't loose much points, but when you loose to a bad a player you'll loose a lot. Likewise when you win.
- npip997 months ago
  I often see it rendered as "Elo" but I've always found it more natural to capitalize as "ELO", but perhaps I should swap to "Elo" given this. Pronouncing "ee-low" is certainly the way it's done in chess/esports though!
- reactordev7 months ago
  It’s also popular in ranking online players in games… really any game where there’s an win/loss ranking..
- bbstats7 months ago
  (also because it's a name, you don't capitalize all three letters)
- ghita_7 months ago
  oh interesting, had no idea, thanks for sharing
- amelius7 months ago
  What was his ELO rating?
  - homarp7 months ago
    https://chess.stackexchange.com/questions/35420/what-was-arp...
    2065
timhh7 months ago
Explanation of Bradley-Terry here: https://stats.stackexchange.com/a/131270/60526
It's such a great and simple algorithm. I feel like it deserves to be more widely known.
I used it at Dyson to evaluate really subjective things like how straight a tress of hair is - pretty much impossible to say if you just look at a photo, but you can ask a bunch of people to compare two photos and say which looks straighter, then you can get an objective ranking.
- npip997 months ago
  Yeah absolutely. In your link, it iterates on _ = ^{_}, until it finds the fixed point.
  In our training pipeline, we had to convert the fixed point iteration to be on _ directly for numerical stability. I have a post on that here!: https://hackmd.io/x3_EkXGKRdeq-rNHo_RpZA
  Bradley-Terry also very cleanly turns into a loss function that you can do gradient descent on, which will cause your model to efficiently learn Elo scores! Our calculations are at: https://hackmd.io/eOwlF7O_Q1K4hj7WZcYFiw
patrickhogan17 months ago
Awesome! This is great!
The link in the article to the full blog explaining rerankers is 404ing for me.
Questions to you as an expert related to search ranking. With o3 and source quality thresholds when performing web search. Could we implement an ELO-style cutoff where systems default to “I don’t know” rather than citing low-ranked sources?
Currently o3’s main weakness is mixing high-quality sources with poor ones when it uses the web search in the same response. The answer sounds authoritative throughout, but parts are backed by unreliable sources. This makes it harder to trust even the well-sourced portions (e.g. believing the US election is next year - not a hallucination but a poorly date formatted source it used). It also makes the response a lot slower.
Would a hard quality threshold be better than the current approach of seamlessly blending good and bad sources?
- ghita_7 months ago
  Hey! Thanks so much! I fixed the link thanks for flagging. Yes the same approach could be used for internet search. The fact that we now have an "absolute score" is very interesting since we can also use a threshold value to determine when an answer simply doesn't exist in a corpus. The only issue is that if all scores are below the cutoff value, you end up discarding them all, and end up with many "I don't know"s. Best approach could just be to flag the "trust" the model has in each source retrieved and use it as such.
mkaszkowiak7 months ago
Happy to see competition in rerankers! Good luck with your product.
My questions: what languages do your models currently support? Did you perform multilingual benchmarks? Couldn't find an answer on the website
- ghita_7 months ago
  Thanks! We trained on most european languages (english, french, spanish, russian...), arabic, and chinese so it does well on those! We haven't tested too much on other languages, but happy to do so if there is a use case
- ethan_smith7 months ago
  Language support is a crucial differentiator for rerankers - would love to see MTEB or other cross-lingual benchmark results if you have them.
rahulnair237 months ago
Interesting work.
For a slightly different take using a similar intuition, see our paper [at ACL 2024](https://arxiv.org/abs/2402.14860) on ranking LLMs which may be of interest.
Our HuggingFace space has some examples: https://huggingface.co/spaces/ibm/llm-rank-themselves
- ghita_7 months ago
  thank you, will check out the paper, the hf space is very cool!
andy_ng7 months ago
Love your work, but pairwise scoring also skips the bigger group context—benchmarks vs list‑wise or MMR methods would highlight trade‑offs. And I’m curious what the compute and latency hit looks like when you run all those pair comparisons in production.
etk9347 months ago
Will the reranker trained with MSE be better calibrated than those trained with InfoNCE? Will threshold on reranker scores be more useful in RAG applications?
- npip997 months ago
  We tried a bradley-terry loss function, as calculated with https://hackmd.io/@-Gjw1zWMSH6lMPRlziQFEw/SJ8sRl1Zge
  We found that MSE after elo-adjustment worked equally well. And, MSE lets you shuffle (q, d) across the dataset which has good statistical properties (Versus contrastive, which makes you sample the same query many times within a single minibatch)
  In this case "InfoNCE" isn't applicable because the reranker's output is a scalar, not a vector. So that's why we checked both bradley-terry and MSE.
rjmunro7 months ago
I think Elo style rankings would be good for rating e.g. Uber rides and restaurant reviews. Instead of asking to rate out of 5 stars or similar, where everyone basically ends up giving 5 stars, just ask was it better or worse than your last experience.
- ricree7 months ago
  Is this really viable for something like Uber, where most rides aren't really meaningfully better or worse?
yalok7 months ago
What’s the expected additional latency due to running this re-ranker?
- ghita_7 months ago
  It actually runs pretty fast, our benchmarks show ~149ms for 12665 bytes. It's faster than many other models
  - esafak7 months ago
    I would prominently display your benchmarks (against your competitors, of course). That's your selling point, right?
    ghita_7 months ago
    Yes! We did this here: https://www.zeroentropy.dev/blog/announcing-zeroentropys-fir... We wanted to share the approach with the community in this post. It does do better than competitors though!
david_shi7 months ago
This is awesome, reminds me of the kind of intuition behind PageRank.
esafak7 months ago
I would have titled it "Improving ranking..."
I like that it works with `sentence_transformers`
- dang7 months ago
  We could change the title to "Improving search ranking with chess Elo scores". Anybody object?
  Edit: ok, done. Submitted title was "Show HN: Improving RAG with chess Elo scores".
  - slybot7 months ago
    They don't use Elo scores. See my comment above, the loss function is adopted from Bradley-Terry.
    npip997 months ago
    Bradley-Terry and Elo scores are equivalent mathematical models! The fundamental presumption is the same Thurstone model - that an individual's skill in a particular game is a normally distributed random variable around their fundamental skill.
    We did experiment with a Bradley-Terry loss function (https://hackmd.io/eOwlF7O_Q1K4hj7WZcYFiw), but we found that even better was to calculate Elo scores, do cross-query bias adjustment, and then MSE loss to predict the Elo score itself.
    slybot7 months ago
    ->Bradley-Terry and Elo scores are equivalent mathematical models! No, they are not equivalent mathematical models, they are equalivant in terms of calculation of score function(logistic) given equivalent scale factors. Such that, Bradley-terry: 1/(1 + e^(x(r_B - r_A))) and Elo rating: 1/(1 + 10^((r_B - r_A)/y)), then equivalance requires x = ln(10)/y. More importantly, Elo rating is online scoring system, meaning it takes into accoun the sequence of the events. From your blog post, I understand that you are not updating the scores after after each event. In other words, Elo rating can be interpreted as an incremental fitting of a Bradley-Terry (using similar logistic) model but not the same!
    -> The fundamental presumption is the same Thurstone model The Thurstone model is similar, and as you said it assumes normal (as opposed to logistic) using probit link function. It predates both models and due to computational constraints, you can call Bradley-Terry and Elo rating computationally convenient approximation of the Thurstone model.
    -> We did experiment with a Bradley-Terry loss function (https://hackmd.io/eOwlF7O_Q1K4hj7WZcYFiw) The math is correct. Thanks for sharing. Indeed, if you do it with incremental updating, you will lose the differentiability given the next winning probability is dependent on the previous updates. Call it what you want, but note that this is not truly and Elo rating which leads misunderstanding. It is Bradley-Terry given you do batch updates which you take extra steps to connect with Elo score, as shown in the link.
    Lastly, normal and logistic distribution will lead to log(0) in evaluations which results inf in loss. As I can see from you upper comment, you try add uniform(0.02) as ad-hoc fix. An elegant fix to that is use heavy-tailed distribution such as Cauchy.
- ghita_7 months ago
  yes we found it hard to find a good title for this, thanks for the feedback
bbstats7 months ago
Little reminder that Elo is a guy, not an acronym :)
- FredrikMeyer7 months ago
  Came to comment this. As a consequence, writing it in capital letters "ELO" is wrong.
adamgusky7 months ago
super cool
CHUCKEESH7 months ago
[dead]
sippeangelo7 months ago
Really cool stuff! Just want to let you know you forgot to link to the evals at the end.
- ghita_7 months ago
  oh waw thanks for flagging, just fixed, thanks!