https://huggingface.co/spaces/ArtificialAnalysis/Text-to-Ima...
I'm also a bit surprised they have gpt-image-1.5 so high above Nano Banana 2 - my limited testing shows that, at least for the visual styles, people like Nano Banana more.
For a point of reference, I run a pretty comprehensive image model comparison site heavily weighted in favor of prompt adherence.
https://genai-showdown.specr.net
EDIT: FWIW, I agree with your assessment. OpenAI's models have always been very strong in prompt adherence but visually weak (gpt-image-1 had the famous "piss filter" until they finally pushed out gpt-image-1.5)
Did you manually review all the edit results manually yourself, or do you have some kind of automated procedure?
- Takes the platonic set of prompts
- Uses model specific tuning directives with LLMs to create a bunch of prompt variations so that they get a diverse set of natural language expressions to "roll" generations
But I still have to manually review each of the final image - which is pretty time-consuming. I've tried automating it using VLMs (like Qwen3-VL) but unfortunately they can miss the small details and didn't provide as much value as I was hoping.