Hacker News
new
top
best
ask
show
job
METR can barely measure Claude Mythos – 50% task horizon now exceeds 16 hours
(
hugonomy.com
)
1 point
by
GlyphWeaver_a
3 hours ago
2 comments
overthinker_jp
3 hours ago
Capability benchmarks may become less meaningful once agents operate across long execution horizons with external tools and permissions. The governance problem starts shifting toward execution boundaries and observability.
GlyphWeaver_a
3 hours ago
[dead]