[1] The photo of the outfit: https://share.google/mHJbchlsTNJ771yBa
For example I think there would be a lot of businesses in the US that would be too afraid of backlash to use AI generated imagery for an itinerary like the one at https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwe...
In my (very personal) opinion, they're part of a very small group of organizations that sell inference under a sane and successful business model.
How. By magic? You fell for 'Deepseek V3 is as good as SOTA'?
What Linux tools are you guys using for image generation models like Qwen's diffusion models, since LMStudio only supports text gen.
Sad state of affairs and seems they're enshittifying quicker than expected, but was always a question of when, not if.
But if you translate the actual prompt the term riding doesn't even appear. The prompt describes the exact thing you see in excruciating detail.
"... A muscular, robust adult brown horse standing proudly, its forelegs heavily pressing between the shoulder blades and spine of a reclining man ... and its eyes sharp and focused, exuding a primal sense of power. The subdued man is a white male, 30-40 years old, his face covered in dust and sweat ... his body is in a push-up position—his palms are pressed hard against the cracked, dry earth, his knuckles white, the veins in his arms bulging, his legs stretched straight back and taut, his toes digging into the ground, his entire torso trembling slightly from the weight ..."
Yeah, as they go through their workflow earlier in the blog post, that prompt they share there seems to be generated by a different input, then that prompt is passed to the actual model. So the workflow is something like "User prompt input -> Expand input with LLMs -> Send expanded prompt to image model".
So I think "human riding a horse" is the user prompt, which gets expanded to what they share in the post, which is what the model actually uses. This is also how they've presented all their previous image models, by passing user input through a LLM for "expansion" first.
Seems poorly thought out not to make it 100% clear what the actual humanly-written prompt is though, not sure why they wouldn't share that upfront.
LinkedIn is filled with them now.
Much like the pointless ASCII diagrams in GitHub readmes (big rectangle with bullet points flows to another...), the diagrams are cognitive slurry.
See Gas Town for non-Qwen examples of how bad it can get:
https://news.ycombinator.com/item?id=46746045
(Not commenting on the other results of this model outside of diagramming.)
Thank you for this phrase. I don't think that bad diagrams are limited to the AI in any way and this perfectly describes all "this didn't make things any clearer" cases.
(I don’t even know if I’m being sarcastic)
A muscular, robust adult brown horse stands proudly, its forelegs heavily pressing between the shoulder blades and spine of a reclining man. Its hind legs are taut, its neck held high, its mane flying against the wind, its nostrils flared, and its eyes sharp and focused, exuding a primal sense of power. The subdued man is a white male...
Do western AI models mostly default to white people?
Embarrassing image? I'm white, why would I be embarrassed over that image? It's a computer generated image with no real people in it, how could it be embarrassing for alive humans?
In another post you talked about people getting mad at the image without context What context are we missing exactly. I do not feel ill informed or angry. But I could indeed be missing something, can you explain the context? If you where to say it's because of the LLM adding more context then that could be plausible, but why the medieval and hemp-rope? I know how sensitive the western companies have been on their models getting rid of negative racial stereo-types, going as far as to avoid and modify certain training data, would you accept an LLM producing negative stereotypes or tending to put one particular racial group into a submissive situation then others?
I really do feel like the idea that the LLM would just take the prompt A human male being ridden by a horse to include all those other details and go straight for a darker, somber tone and expression and a dynamic of domination and submission rather then a more humorous description, unlikely.
Why? I don't see that. Are black people embarrassed if a black person commits a crime, yet not embarrassed if a white person commits a crime? That sounds very contrived to me and not at all how things work in reality.
> If ones own race is being denigrated then one may indeed feel embarrassment
I also don't understand this. Why would every white person feel any sort of embarrassment over images denigrating white people? Feel hate, anger or lots of other emotions, that'd make sense. But I still don't understand why "embarrassment" or shame is even on the table, embarrassment over what exactly? That there are racists?
No, they mostly default to black people even in historical contexts where they are completely out of place, actually. [1]
"Google paused its AI image-generator after Gemini depicted America's founding fathers and Nazi soldiers as Black. The images went viral, embarrassing Google."
[1] https://www.npr.org/2024/03/18/1239107313/google-races-to-fi...
You're referring to a case of one version of one model. That's not "mostly" or "default to".
> Generate a photo of the founding fathers of a future, non-existing country. Five people in total.
with Nano Banana Pro (the SOTA). I tried the same prompt 5 times and every time black people are the majority. So yeah, I think the parent comment is not that far off.
But for an out of context imaginary future... why would you choose non-black people? There's about the same reason to go with any random look.
(I suspect you tried a prompt about the original founding fathers, and found it didn't make that mistake any more.)
Anyway, you're tagged as "argued Musk salute wasn't nazi", so your ability to parse history is a little damaged.
"I just tried this prompt:
> Generate a photo of the founding fathers of a future, non-existing country. Five people in total.
I tried the same prompt 5 times and every time black people are the majority"
Do you understand the concept of "mostly defaulting" to something, and how that is directly related to a group of "people [being] the majority"?
> Anyway, you're tagged as "argued Musk salute wasn't nazi", so your ability to parse history is a little damaged.
I don't really care what communists think since you aren't rational people. If you have any actual statement to make and for me to deconstruct again while pointing out your inability to follow through with basic logic or facts, please let me know.
What the actual fuck
---
A desolate grassland stretches into the distance, its ground dry and cracked. Fine dust is kicked up by vigorous activity, forming a faint grayish-brown mist in the low sky.
Mid-ground, eye-level composition: A muscular, robust adult brown horse stands proudly, its forelegs heavily pressing between the shoulder blades and spine of a reclining man. Its hind legs are taut, its neck held high, its mane flying against the wind, its nostrils flared, and its eyes sharp and focused, exuding a primal sense of power. The subdued man is a white male, 30-40 years old, his face covered in dust and sweat, his short, messy dark brown hair plastered to his forehead, his thick beard slightly damp; he wears a badly worn, grey-green medieval-style robe, the fabric torn and stained with mud in several places, a thick hemp rope tied around his waist, and scratched ankle-high leather boots; his body is in a push-up position—his palms are pressed hard against the cracked, dry earth, his knuckles white, the veins in his arms bulging, his legs stretched straight back and taut, his toes digging into the ground, his entire torso trembling slightly from the weight.
The background is a range of undulating grey-blue mountains, their outlines stark, their peaks hidden beneath a low-hanging, leaden-grey, cloudy sky. The thick clouds diffuse a soft, diffused light, which pours down naturally from the left front at a 45-degree angle, casting clear and voluminous shadows on the horse's belly, the back of the man's hands, and the cracked ground.
The overall color scheme is strictly controlled within the earth tones: the horsehair is warm brown, the robe is a gradient of gray-green-brown, the soil is a mixture of ochre, dry yellow earth, and charcoal gray, the dust is light brownish-gray, and the sky is a transition from matte lead gray to cool gray with a faint glow at the bottom of the clouds.
The image has a realistic, high-definition photographic quality, with extremely fine textures—you can see the sweat on the horse's neck, the wear and tear on the robe's warp and weft threads, the skin pores and stubble, the edges of the cracked soil, and the dust particles. The atmosphere is tense, primitive, and full of suffocating tension from a struggle of biological forces.
I assume our brains are used to stuff which we dont notice conciously, and reject very mild errors. I've stared at the picture a bit now and the finger holding the baloon is weird. The out of place snowman feels weird. If you follow the background blur around it isnt at the same depth everywehere. Everything that reflects, has reflections that I cant see in the scene.
I dont feel good staring at it now so I had to stop.
Like focus stacking, specifically.
I’m always surprised when people bother to point out more-subtle flaws in AI images as “tells”, when the “depth-of-field problem” is so easily spotted, and has been there in every AI image ever since the earliest models.
But I found that that results in more professional looking images, and not more realistic photos.
Adding something like "selfy, Instagram, low resolution, flash" can lead to a .. worse image that looks more realistic.
[0] I think I did this one with z image turbo on my 4060 ti
Also Imagen 4 and Nano Banana Pro are very different models.
But anyway, realistic environments like a street cafe are not suited to test for photorealism. You have to use somewhat more fantastical environments.
I don't have access to z-image, but here are two examples with Nano Banana Pro:
"A person in the streets of Atlantis, portrait shot." https://i.ibb.co/DgMXzbxk/Gemini-Generated-Image-7agf9b7agf9...
"A person in the streets of Atlantis, portrait shot (photorealistic)" https://i.ibb.co/nN7cTzLk/Gemini-Generated-Image-l1fm5al1fm5...
These are terribly unrealistic. Far more so than the Flux.2 Pro image above.
> Also Imagen 4 and Nano Banana Pro are very different models.
No, Imagen 4 is a pure diffusion model. Nano Banana Pro is a Gemini scaffold which uses Imagen to generate an initial image, then Gemini 3 Pro writes prompts to edit the image for much better prompt alignment. The prompts above a very simple, so there is little for Gemini to alter, so they look basically identical to plain Imagen 4. Both pictures (especially the first) have the signature AI look of Imagen 4, which is different from other models like Imagen 3.
By the way, here is GPT Image 1.5 with the same prompts:
"A person in the streets of Atlantis, portrait shot." https://i.ibb.co/Df8nDHFL/Chat-GPT-Image-10-Feb-2026-14-17-1...
"A person in the streets of Atlantis, portrait shot (photorealistic)" https://i.ibb.co/Nns4pdGX/Chat-GPT-Image-10-Feb-2026-14-17-2...
The first is very fake and the second is a strong improvement, though still far from the excellent cafe shots above (fake studio lighting, unrealistic colors etc).
I disagree, nano banana pro result is on a completely different league compare to flux.2 and z-image.
>But anyway, realistic environments like a street cafe are not suited to test for photorealism
Why? It's the perfect settings in my opinion.
Btw I don't think you are using nano banana pro, probably standard nano banana, I'm getting this from your prompt: https://i.ibb.co/wZHx0jS9/unnamed-1.jpg
>Nano Banana Pro is a Gemini scaffold which uses Imagen to generate an initial image, then Gemini 3 Pro writes prompts to edit the image for much better prompt alignment.
First of all how should you know the architecture details of gemini-3-pro-image, second of all how the model can modify the image if gemini itself is just rewriting the prompt (like old chatgpt+dalle), imagen 4 is just a text-to-image model, not an editing one, it doesn't make sense, nano banana pro can edit images (like the ones you can provide).
I strongly disagree. But even if you are right, the difference between the cafe shots and the Atlantis shots is clearly much, much larger than the difference between the different cafe shots. The Atlantis shots are super unrealistic. They look far worse than the cafe shots of Flux.2 Pro.
> Why? It's the perfect settings in my opinion
Because it's too easy obviously. We don't need an AI to make fake realistic photos of realistic environments when we can easily photograph those ourselves. Unrealistic environments are more discriminative because they are much more likely to produce garbage that doesn't look photorealistic.
> Btw I don't think you are using nano banana pro, I'm getting this from your prompt: https://i.ibb.co/wZHx0jS9/unnamed-1.jpg
I'm definitely using Nano Banana Pro, and your picture has the same strong AI look to it that is typical of NBP / Imagen 4.
> First of all how should you know the architecture details of gemini-3-pro-image, second of all how the model can modify the image if gemini itself is just rewriting the prompt (like old chatgpt+dalle), imagen 4 is just a text-to-image model, not an editing one, it doesn't make sense, nano banana pro can edit images (like the ones you can provide).
There were discussions about it previously on HN. Clearly NBP is using Gemini reasoning, and clearly the style of NBP strongly resembles Imagen 4 specifically. There is probably also a special editing model involved, just like in Qwen-Imahe-2.0.