Hacker News
new
top
best
ask
show
job
Bullshit benchmark for LLMs
(
twitter.com
)
1 point
by
gpvos
7 hours ago
1 comment
noemit
7 hours ago
The underlying data looks scarce. If there's only a few questions per "category" of bullshit they can easily be gamed to favor one model over another.