8 pointsby rotemtam5 hours ago3 comments

yonSpektor5 hours ago
Curious what the distribution of hacking strategies looked like across different models — would expect RL-heavy vs RLHF models to cheat very differently.
adamgold75 hours ago
love this. we are actually looking at reward hacking from a cyber security perspective - refreshing (unless you're from Israel).
Any collaborators that want to join us?
matankleyman13 hours ago
that's one of the biggest long term issues with agents that no one has real interest talking about.