6 pointsby bix67 hours ago6 comments
  • PaulHoule6 hours ago
    Evaluation is harder than you think because of statistics.

    Like if you want to accurately know if one model is better than another you have to test it on hundreds if not thousands of examples which are carefully graded in difficulty, not in the training sets, etc.

    Practically you might try model A and model B and use each one 2-3 times on different tasks and walk out with the impression that A is really good and B sux, but it could be model A got lucky because you asked it to do things it is good at or maybe it just got lucky and got the right answer anyway.

    See https://arxiv.org/html/2410.12972v1 and https://arxiv.org/pdf/2505.14810 -- those papers are considering a general space of tasks but you could totally do the same kind of eval for the tasks you care about.

    • bix65 hours ago
      Have you implemented any of this in practice? Eg are you benchmarking models?
      • PaulHoule3 hours ago
        I've done some for classification, ranking, and other sorts of non-generative tasks.
  • freedomben6 hours ago
    This is a hard problem for me as well. Right now I've just been using the best model available (like Opus, or GPT 5.5, or Gemin Pro) but it's not ideal. My problem is anytime I step down the results are subtlely worse and sometimes I don't notice immediately depending on what I'm doing.

    As far as Opus vs. GPT 5.5 etc, I generally decide with:

    1. Code? -> Opus

    2. Docs? -> GPT

    3. Real-time or recent information needed? -> Gemini

    It's far from perfect though. Would love to hear others thoughts.

    • bix65 hours ago
      Opus eats tokens so fast so I try to minimize it but compared to Sonnet I definitely see fewer issues in my larger projects. Sonnet has gone off the rails a few times.
  • noashavit4 hours ago
    Gemini for recent search and google workspace automation

    Perplexity for deep research

    Claude Opus for coding, Sonnet for writing

    Gemma4 for local AI overviews and analysis

    Qwen coder for local prototyping

  • shouvik126 hours ago
    for short, stateless stuff,definitions, formatting, quick lookups I have never noticed a meaningful difference between models. But anything that requires reasoning across a lot of prior context, it's usually claude sonet or opus. But feels like the vibe will soon take me to codex
  • OutrageousTea6 hours ago
    [flagged]
  • jabeer6 hours ago
    [flagged]