Hacker News
new
top
best
ask
show
job
Show HN: RewardHackBench: Using sandboxes to stop agents from cheating
(
github.com
)
8 points
by
rotemtam
5 hours ago
3 comments
yonSpektor
5 hours ago
Curious what the distribution of hacking strategies looked like across different models — would expect RL-heavy vs RLHF models to cheat very differently.
adamgold7
5 hours ago
love this. we are actually looking at reward hacking from a cyber security perspective - refreshing (unless you're from Israel).
Any collaborators that want to join us?
matankleyman1
3 hours ago
that's one of the biggest long term issues with agents that no one has real interest talking about.