1 pointby vtail2 hours ago1 comment
  • vunderba2 hours ago
    For reference, have you seen the Artificial Analysis Image Arena Leaderboard? They also show you two images from anonymized models (shows after you vote), and calculates crowdsourced ELO ratings.

    https://huggingface.co/spaces/ArtificialAnalysis/Text-to-Ima...

    • vtail2 hours ago
      Thanks - and no, I haven't seen this one. I like how they have the edit mode dashboard - show the original image + two edits; I was thinking about doing something like this.

      I'm also a bit surprised they have gpt-image-1.5 so high above Nano Banana 2 - my limited testing shows that, at least for the visual styles, people like Nano Banana more.

      • vunderba2 hours ago
        Yeah I think that it's part of the issue with a single "squashed" comparative metric. Some users are going to grade higher based on the overall visual fidelity and others are going to value following the prompt.

        For a point of reference, I run a pretty comprehensive image model comparison site heavily weighted in favor of prompt adherence.

        https://genai-showdown.specr.net

        EDIT: FWIW, I agree with your assessment. OpenAI's models have always been very strong in prompt adherence but visually weak (gpt-image-1 had the famous "piss filter" until they finally pushed out gpt-image-1.5)

        • vtail2 hours ago
          Very cool site - I think I saw it before here on HN, and I liked it a lot.

          Did you manually review all the edit results manually yourself, or do you have some kind of automated procedure?

          • vunderbaan hour ago
            Thanks. So I have a bespoke python program that basically does this:

            - Takes the platonic set of prompts

            - Uses model specific tuning directives with LLMs to create a bunch of prompt variations so that they get a diverse set of natural language expressions to "roll" generations

            But I still have to manually review each of the final image - which is pretty time-consuming. I've tried automating it using VLMs (like Qwen3-VL) but unfortunately they can miss the small details and didn't provide as much value as I was hoping.