1 pointby CodeReclaimers4 hours ago2 comments
  • cold_harbor4 hours ago
    reward hacking = the model finding the fastest path to a high score, not the behavior you wanted. same reason RLHF reward models degrade with too many optimization steps.
    • CodeReclaimers4 hours ago
      Agreed. The wrinkle I thought was worth writing up is: there's no learned reward model here and no training at all. The "reward" is wall-clock executiion time and the model is frozen; the search is happening at inference time, not in an RL loop. So the usual "the proxy is a fuzzy approximation that degrades under optimization pressure" story doesn't apply.

      This was on a ~200-line surface I thought I'd locked down, and it still got gamed in a way I might not have caught right away if it wasn't a nearly impossible run time (~45usec). So anyways...you apparently don't need a soft proxy or a lot of steps for this kind of thing to show up.

  • CodeReclaimers4 hours ago
    [flagged]