8 pointsby rotemtam5 hours ago3 comments
  • yonSpektor5 hours ago
    Curious what the distribution of hacking strategies looked like across different models — would expect RL-heavy vs RLHF models to cheat very differently.
  • adamgold75 hours ago
    love this. we are actually looking at reward hacking from a cyber security perspective - refreshing (unless you're from Israel).

    Any collaborators that want to join us?

  • matankleyman13 hours ago
    that's one of the biggest long term issues with agents that no one has real interest talking about.