And Grok 4 is a great example where they're just completely lying about the practical results. Elon wants to claim this is the smartest model, but it's like... 3rd or 4th best, at best.
Benchmarks, for a variety of reasons, now seem inadequate to capture models' actual strength, so I decided to run Grok 4 and o3 (and Grok 4 Heavy + o3-pro) through a gauntlet of questions that I think demonstrate real, practical differences between the two.
Hope this is helpful!