2 pointsby yakirmat5 hours ago2 comments
  • Sasisundar094 hours ago
    Curious how you are handling benchmark reliability. Have you seen cases where evaluations pass but production behavior fails?
  • yakirmat5 hours ago
    [flagged]