We noticed that most people make their decisions based on popular benchmarks scores. However, widely used benchmarks like MTEB are often overly clean, generic, and in many cases, have been memorized by the embedding models during training. To address this, we introduce representative generative benchmarking—custom evaluation sets built from your own data and reflective of the queries users actually make in production.
We just published our in-depth technical report on this, and you can run a custom benchmark locally with the Chroma CLI.