15 pointsby gmays7 hours ago4 comments
  • nikisweeting5 hours ago
    We can definitely make harder evals, the problem is a good eval set is indistinguishable from good training data / market edge, so no one is incentivized to share their best eval sets publicly.
  • WarmWash6 hours ago
    Start front loading the models with 5k, 10k, 50k, 100k tokens of messy quasi related context, and then run the benchmarks.

    These models are ridiculously powerful with a blank slate. It's when they get loaded down with all the necessary (and inevitably unnecessary) context to complete the task that they really start to crumble and fold.

    • jballanc5 hours ago
      We need benchmarks that can distinguish between continuous learning and long-context extrapolation.
  • UltraSane4 hours ago
    This is the least true thing ever. All LLMs are terrible at ARC-AGI-3. Every video game can be used as a benchmark. You could rank LLMs on how long they can keep a game of Dwarf Fortress running or how fast they can beat GTA5.
    • ttoinou4 hours ago
      We already have specialized AI to play video games
      • UltraSane4 hours ago
        We are talking about LLMs. a true AGI would be able to beat every video game.
        • conception4 hours ago
          Until Arc-Battletoads is passed I’m not buying it.
          • UltraSanean hour ago
            More like ARC-SegaMasterSystem-ALF
  • refactorbench3 hours ago
    [dead]