Qwen-Image-2.0: Professional infographics, exquisite photorealism(qwen.ai)

169 pointsby meetpateltech6 hours ago16 comments

tianqi4 hours ago
I've seen many comments describing the "horse riding man" example as extremely bizarre (which it actually is), so I'd like to provide some background context here. The "horse riding man" is a Chinese internet meme originating from an entertainment awards ceremony, when the renowned host Tsai Kang-yong wore an elaborate outfit featuring a horse riding on his back[1]. At the time, he was embroiled in a rumor about his unpublicized homosexual partner, whose name sounded "Ma Qi Ren" which coincidentally translates to "horse riding man" in Mandarin. This incident spread widely across Chinese internet and turned into a meme. So they used "horse riding man" as an example isn't entirely nonsensical, though the image per se is undeniably bizarre and carries an unsettling vibe.
[1] The photo of the outfit: https://share.google/mHJbchlsTNJ771yBa
- Lerc40 minutes ago
  On the topic of modern Chinese culture, Is there the same hostility towards AI generated Imagery in China as there seems to be in America?
  For example I think there would be a lot of businesses in the US that would be too afraid of backlash to use AI generated imagery for an itinerary like the one at https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwe...
  - tianqi7 minutes ago
    Since China has a population of 1.4 billion people with vastly differing levels of cognition, I find it difficult to claim I can summarize "modern Chinese culture". But within my range of observation, no. Chinese not only have no hostility toward AI but actively pursues and reveres it with fervor. They widely perceive AI as an advanced force, a new opportunity for everyone, a new avenue for making money, and a new chance to surpass others. At most, some of the consumers might associate businesses using AI generated content with a budget-conscious brand image, but not hostile.
- yorwba3 hours ago
  There's also the "horse riding astronaut" challenge in image generation: https://garymarcus.substack.com/p/horse-rides-astronaut-redu...
- badhorseman3 hours ago
  Why not ask for simply a man or even an Han man given the race of Tsai Kang-yong. Why a white man and why a man wearing medieval clothing. Gives your head a wobble.
  - DustinEchoes41 minutes ago
    Yep, it’s the only image on the entire page with a non-Chinese person in it. Given the prompt, the message is clear.
  - 3 hours ago
    undefined
raincole5 hours ago
It's crazy to think there was a fleeting sliver of time during which Midjourney felt like the pinnacle of image generation.
- Mashimo5 hours ago
  What ever happend to midjourney?
  - Lalabadie14 minutes ago
    No external funding raised. They're not on the VC path, so no need to chase insane growth. They still have around 500M USD in ARR.
    In my (very personal) opinion, they're part of a very small group of organizations that sell inference under a sane and successful business model.
  - wongarsu4 hours ago
    They have image and video models that are nowhere near SOTA on prompt adherence or image editing but pretty good on the artistic side. They lean in on features like reference images so objects or characters have a consistent look, biasing the model towards your style preferences, or using moodboards to generate a consistent style
  - raincole5 hours ago
    Not much, while everything happened at OpenAI/Google/Chinese companies. And that's the problem.
    KeplerBoy4 hours ago
    How is it a problem? There simply doesn't seem to be a moat or secret sauce. Who cares which of these models is SOTA? In two months there will be a new model.
    waldarbeiter4 hours ago
    There seems to be a moat like infrastructure/gpus and talent. The best models right now come from companies with considerable resources/funding.
    esperent2 hours ago
    Right, but that's a short term moat. If they pause on their incredible levels of spending for even 6 months, someone else will take over having spent only a tiny fraction of what they did. They might get taken over anyway.
    raincolean hour ago
    > someone else will take over having spent only a tiny fraction of what they did
    How. By magic? You fell for 'Deepseek V3 is as good as SOTA'?
    Gud37 minutes ago
    By reverse engineering, sheer stupidity from the competition, corporate espionage, ‘stealing’ engineers and sometimes a stroke of genius, the same as it’s always been
inanothertime4 hours ago
I recently tried out LMStudio on Linux for local models. So easy to use!
What Linux tools are you guys using for image generation models like Qwen's diffusion models, since LMStudio only supports text gen.
- sequence721 minutes ago
  If you're on an AMD platform Lemonade (https://lemonade-server.ai/) added image generation in version 9.2 (https://github.com/lemonade-sdk/lemonade/releases/tag/v9.2.0).
- eurekin2 hours ago
  Practically anybody actually creating with this class of models (diffusion based mostly) is using ComfyUI. Community takes care of quantization, repackaging into gguf (most popular) and even speed optimizing (lighting loras, layers skip). It's quite extensive
- embedding-shape4 hours ago
  Everything keeps changing so quickly, I basically have my own Python HTTP server with a unified JSON interface, then that can be routed to any of the impls/*.py files for the actual generation, then I have of those per implementation/architecture basically. Mostly using `diffusers` for the inference, which isn't the fastest, but tends to have the new model architectures much sooner than everyone else.
- guai8884 hours ago
  ComfyUI is the best for stable diffusion
  - embedding-shape3 hours ago
    FWIW you can use non-sd models in ComfyUI too, the ecosystem is pretty huge and supports most of the "mainstream" models, not only the stable diffusion ones, even video models and more too.
- ilaksh4 hours ago
  I have my own MIT licensed framework/UI: https://github.com/runvnc/mindroot. With Nano Banana via runvnc/googleimageedit
- PaulKeeble3 hours ago
  Ollama is working on adding image generation but its not here yet. We really do need something that can run a variety of models for images.
  - embedding-shape3 hours ago
    Yeah, I'm guessing they were bound to leave behind the whole "Get up and running with large language models" mission sooner or later, which was their initial focus, as investors after 2-3 years start making you to start thinking about expansion and earning back the money.
    Sad state of affairs and seems they're enshittifying quicker than expected, but was always a question of when, not if.
sandbach4 hours ago
The Chinese vertical typography is sadly a bit off. If punctuation marks are used at all, they should be the characters specifically designed for vertical text, like ︒(U+FE12 PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC FULL STOP).
fguerraz5 hours ago
I found the horse revenge-porn image at the end quite disturbing.
- embedding-shape4 hours ago
  I think they call it "horse riding a human" which could have taken two very different directions, and the direction the model seems to have taken was the least worst of the two.
  - wongarsu4 hours ago
    At first I thought it's a clever prompt because you see which direction the model takes it, and whether it "corrects" it to the more common "human riding a horse" similar to the full wine glass test.
    But if you translate the actual prompt the term riding doesn't even appear. The prompt describes the exact thing you see in excruciating detail.
    "... A muscular, robust adult brown horse standing proudly, its forelegs heavily pressing between the shoulder blades and spine of a reclining man ... and its eyes sharp and focused, exuding a primal sense of power. The subdued man is a white male, 30-40 years old, his face covered in dust and sweat ... his body is in a push-up position—his palms are pressed hard against the cracked, dry earth, his knuckles white, the veins in his arms bulging, his legs stretched straight back and taut, his toes digging into the ground, his entire torso trembling slightly from the weight ..."
    embedding-shape3 hours ago
    > But if you translate the actual prompt the term riding doesn't even appear. The prompt describes the exact thing you see in excruciating detail.
    Yeah, as they go through their workflow earlier in the blog post, that prompt they share there seems to be generated by a different input, then that prompt is passed to the actual model. So the workflow is something like "User prompt input -> Expand input with LLMs -> Send expanded prompt to image model".
    So I think "human riding a horse" is the user prompt, which gets expanded to what they share in the post, which is what the model actually uses. This is also how they've presented all their previous image models, by passing user input through a LLM for "expansion" first.
    Seems poorly thought out not to make it 100% clear what the actual humanly-written prompt is though, not sure why they wouldn't share that upfront.
  - chakintosh30 minutes ago
    Is it related to "Mr Hands" ?
- blitzar3 hours ago
  Wont someone think of the horses.
dsrtslnd235 hours ago
unfortunately no open weights it seems.
- embedding-shape3 hours ago
  To be fair, didn't they release open weights image model only like a ~month ago? Think last one was in December 2025.
wiether4 hours ago
I use gen-AI to produce images daily, but honestly the infographics are 99% terrible.
LinkedIn is filled with them now.
- smcleod4 hours ago
  To be fair it hasn't made LinkedIn any worse than it already was.
  - nurettin4 hours ago
    To be fair, it is hard to make LinkedIn any worse.
    embedding-shape3 hours ago
    I was gonna make a joke about "Wish granted, now Microsoft owns it" but then I remembered that they already do. Reality sometimes makes better jokes than what we can come up with.
- viraptor3 hours ago
  Informatics are as bad as the author allows though. There's few people who could make or even describe a good infographic, so that's what we see in the results too.
- mdrzn3 hours ago
  Infographics and full presentations are a NanoBananaPro exclusive so far.
- usefulposter4 hours ago
  Correct.
  Much like the pointless ASCII diagrams in GitHub readmes (big rectangle with bullet points flows to another...), the diagrams are cognitive slurry.
  See Gas Town for non-Qwen examples of how bad it can get:
  https://news.ycombinator.com/item?id=46746045
  (Not commenting on the other results of this model outside of diagramming.)
  - viraptor3 hours ago
    > cognitive slurry
    Thank you for this phrase. I don't think that bad diagrams are limited to the AI in any way and this perfectly describes all "this didn't make things any clearer" cases.
cocodill5 hours ago
interesting riding application picture
- rwmj4 hours ago
  "Guy being humped by a horse" wouldn't have been my first choice for demoing the capabilities of the model, but each to their own I guess.
  - viraptor3 hours ago
    It looks like a marketing move. It's a good quality, detailed picture. It's going to get shared a lot. I would assume they knew exactly what they were doing. Nothing like a bit of controversy for extra clicks.
    brookst2 hours ago
    Because every ML researcher is a viral social media expert.
    (I don’t even know if I’m being sarcastic)
    viraptor2 hours ago
    This is not some random ML researcher doing fun things at home. Qwen is backed by Alibaba cloud. They likely have whole departments of marketing people available.
goga-piven4 hours ago
Why is the only image featuring non-Asian men the one under the horse?
- z3dd4 hours ago
  they explicitly called for that in the prompt
  - goga-piven4 hours ago
    Exactly why did they choose this prompt with a white person and not an Asian person, as in all the other examples?
  - wtcactus4 hours ago
    But why? That image actually puzzled me. Does it have some background context? Some historical legend or something of the like?
    joeycodes4 hours ago
    It is Lunar New Year season right now, 2026 is year of the horse, there is celebratory horse imagery everywhere in many Asian countries right now, so this image could be interpreted as East trampling West. I have no way to know the intention of the person at Qwen who wrote this, but you can form your own conclusions from the prompt:
    A muscular, robust adult brown horse stands proudly, its forelegs heavily pressing between the shoulder blades and spine of a reclining man. Its hind legs are taut, its neck held high, its mane flying against the wind, its nostrils flared, and its eyes sharp and focused, exuding a primal sense of power. The subdued man is a white male...
    wtcactus4 hours ago
    So, it’s just racism, pure and simple.
    badhorseman3 hours ago
    [dead]
- andruby4 hours ago
  Is the problem the position/horse or that Qwen mostly shows asian people?
  Do western AI models mostly default to white people?
  - goga-piven4 hours ago
    Well, what if some western models showcase white people in all good-looking images and the only embarrassing image features Asian people? wouldn't that be considered racism?
    embedding-shape3 hours ago
    > and the only embarrassing image
    Embarrassing image? I'm white, why would I be embarrassed over that image? It's a computer generated image with no real people in it, how could it be embarrassing for alive humans?
    badhorseman2 hours ago
    I assume you feel the same about Nazi propaganda or Racist caricatures of black people. since it's not real and just a drawing.
    embedding-shapean hour ago
    Yeah, why would I feel embarrassed over either of those things? I get angry when I see nazi propaganda, feel hopeless sometimes when I see racist caricatures, but never "embarrassed", that wouldn't make much sense. What would I be embarrassed about exactly?
    badhorseman35 minutes ago
    Indeed if ones own race is not being denigrated one would not feel embarrassed, although one may be embarrassed that racist material was created by their people. If ones own race is being denigrated then one may indeed feel embarrassment and perhaps also the anger and hopelessness. As for why exactly embarrassment if the purpose is to degrade by pointing some reason why the author holds your people in contempt and you are indeed hopeless as to stop it, shame and embarrassment is often what is felt.
    In another post you talked about people getting mad at the image without context What context are we missing exactly. I do not feel ill informed or angry. But I could indeed be missing something, can you explain the context? If you where to say it's because of the LLM adding more context then that could be plausible, but why the medieval and hemp-rope? I know how sensitive the western companies have been on their models getting rid of negative racial stereo-types, going as far as to avoid and modify certain training data, would you accept an LLM producing negative stereotypes or tending to put one particular racial group into a submissive situation then others?
    I really do feel like the idea that the LLM would just take the prompt A human male being ridden by a horse to include all those other details and go straight for a darker, somber tone and expression and a dynamic of domination and submission rather then a more humorous description, unlikely.
    embedding-shape30 minutes ago
    > although one may be embarrassed that racist material was created by their people
    Why? I don't see that. Are black people embarrassed if a black person commits a crime, yet not embarrassed if a white person commits a crime? That sounds very contrived to me and not at all how things work in reality.
    > If ones own race is being denigrated then one may indeed feel embarrassment
    I also don't understand this. Why would every white person feel any sort of embarrassment over images denigrating white people? Feel hate, anger or lots of other emotions, that'd make sense. But I still don't understand why "embarrassment" or shame is even on the table, embarrassment over what exactly? That there are racists?
  - wtcactus4 hours ago
    > Do western AI models mostly default to white people?
    No, they mostly default to black people even in historical contexts where they are completely out of place, actually. [1]
    "Google paused its AI image-generator after Gemini depicted America's founding fathers and Nazi soldiers as Black. The images went viral, embarrassing Google."
    [1] https://www.npr.org/2024/03/18/1239107313/google-races-to-fi...
    viraptor4 hours ago
    > they mostly default to black people
    You're referring to a case of one version of one model. That's not "mostly" or "default to".
    raincole4 hours ago
    Out of curiosity I just tried this prompt:
    > Generate a photo of the founding fathers of a future, non-existing country. Five people in total.
    with Nano Banana Pro (the SOTA). I tried the same prompt 5 times and every time black people are the majority. So yeah, I think the parent comment is not that far off.
    viraptor3 hours ago
    Luck? 1 black person, 3 south Asian in total for me.
    But for an out of context imaginary future... why would you choose non-black people? There's about the same reason to go with any random look.
    wtcactus3 hours ago
    So, the answer to the question "Do western AI models mostly default to white people?" is clearly a resounding: no, they don't.
    viraptor3 hours ago
    No. But neither black people. Or anyone specifically. So we got to a nice balance it seems.
    KingMob3 hours ago
    I mean it's still far off, because they said "historical context", i.e., the actual past, but your prompt is about a hypothetical future.
    (I suspect you tried a prompt about the original founding fathers, and found it didn't make that mistake any more.)
    wtcactus3 hours ago
    The question was "Do western AI models mostly default to white people?" and the answer is no, they don't, they mostly default to black people. And some examples are so egregious, that even in historical settings, they replace the white people by black people.
    KingMob3 hours ago
    Yes, but you said "they mostly default to black people". Nice try at moving the goal posts.
    Anyway, you're tagged as "argued Musk salute wasn't nazi", so your ability to parse history is a little damaged.
    wtcactus18 minutes ago
    Exactly where were my goal posts moved? I stated yet again "mostly default to black people". You are the one trying to move the goal posts because you just answered this message:
    "I just tried this prompt:
    > Generate a photo of the founding fathers of a future, non-existing country. Five people in total.
    I tried the same prompt 5 times and every time black people are the majority"
    Do you understand the concept of "mostly defaulting" to something, and how that is directly related to a group of "people [being] the majority"?
    > Anyway, you're tagged as "argued Musk salute wasn't nazi", so your ability to parse history is a little damaged.
    I don't really care what communists think since you aren't rational people. If you have any actual statement to make and for me to deconstruct again while pointing out your inability to follow through with basic logic or facts, please let me know.
skerit5 hours ago
> Qwen-Image-2.0 not only accurately models the “riding” action but also meticulously renders the horse’s musculature and hair > https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwe...
What the actual fuck
- wongarsu4 hours ago
  For reference, below is the prompt translated (with my highlighting of the part that matters). They did very much ask for this version of "horse riding a man", not the "horse sitting upright on a crawling human" version
  ---
  A desolate grassland stretches into the distance, its ground dry and cracked. Fine dust is kicked up by vigorous activity, forming a faint grayish-brown mist in the low sky.
  Mid-ground, eye-level composition: A muscular, robust adult brown horse stands proudly, its forelegs heavily pressing between the shoulder blades and spine of a reclining man. Its hind legs are taut, its neck held high, its mane flying against the wind, its nostrils flared, and its eyes sharp and focused, exuding a primal sense of power. The subdued man is a white male, 30-40 years old, his face covered in dust and sweat, his short, messy dark brown hair plastered to his forehead, his thick beard slightly damp; he wears a badly worn, grey-green medieval-style robe, the fabric torn and stained with mud in several places, a thick hemp rope tied around his waist, and scratched ankle-high leather boots; his body is in a push-up position—his palms are pressed hard against the cracked, dry earth, his knuckles white, the veins in his arms bulging, his legs stretched straight back and taut, his toes digging into the ground, his entire torso trembling slightly from the weight.
  The background is a range of undulating grey-blue mountains, their outlines stark, their peaks hidden beneath a low-hanging, leaden-grey, cloudy sky. The thick clouds diffuse a soft, diffused light, which pours down naturally from the left front at a 45-degree angle, casting clear and voluminous shadows on the horse's belly, the back of the man's hands, and the cracked ground.
  The overall color scheme is strictly controlled within the earth tones: the horsehair is warm brown, the robe is a gradient of gray-green-brown, the soil is a mixture of ochre, dry yellow earth, and charcoal gray, the dust is light brownish-gray, and the sky is a transition from matte lead gray to cool gray with a faint glow at the bottom of the clouds.
  The image has a realistic, high-definition photographic quality, with extremely fine textures—you can see the sweat on the horse's neck, the wear and tear on the robe's warp and weft threads, the skin pores and stubble, the edges of the cracked soil, and the dust particles. The atmosphere is tense, primitive, and full of suffocating tension from a struggle of biological forces.
  - badhorseman3 hours ago
    The significance of the hemp-rope is that it is symbol of morning and loss of ones decedent.
- embedding-shape3 hours ago
  I like how sometimes I get angry at a LLM for not understanding what I meant, but then I realize that I just forgot to mention it in the context. It's fun to see the same thing happen in humans reading websites too, where they don't understand the context yet react with strong feelings anyways.
Deukhoofd5 hours ago
The text rendering is quite impressive, but is it just me or do all these generated 'realistic' images have a distinctly uncanny feel to it. I can't quite put my finger on it what it is, but they just feel off to me.
- finnjohnsen25 hours ago
  I agree. They makes me nauseous. The same kind of light nausea as car sickness.
  I assume our brains are used to stuff which we dont notice conciously, and reject very mild errors. I've stared at the picture a bit now and the finger holding the baloon is weird. The out of place snowman feels weird. If you follow the background blur around it isnt at the same depth everywehere. Everything that reflects, has reflections that I cant see in the scene.
  I dont feel good staring at it now so I had to stop.
  - jbl0ndie4 hours ago
    Sounds like you're describing the uncanny valley https://en.wikipedia.org/wiki/Uncanny_valley
- elorant4 hours ago
  The lighting is wrong, that's what's telling to me. They look too crisp. No proper shadows, everything looks crystal clear.
  - techpression4 hours ago
    It’s the HDR era all over again, where people edited their photos to lack all contrast and just be ultra flat.
- brookst2 hours ago
  Everything is weightless. When real people stand and gesture there’s natural muscle use, hair and clothing drape, papers lay flat on surfaces.
- likium5 hours ago
  At least for the real life pictures, there’s no depth of field. Everything is crystal clear like it’s composited.
  - derefr5 hours ago
    > like its composited
    Like focus stacking, specifically.
    I’m always surprised when people bother to point out more-subtle flaws in AI images as “tells”, when the “depth-of-field problem” is so easily spotted, and has been there in every AI image ever since the earliest models.
    Mashimo5 hours ago
    I had no problems getting images with blurry background with the appropriate prompts. Something like "shallow depth of fields, bokeh, DSLR" can lead to good results. https://cdn.discordapp.com/attachments/1180506623475720222/1... [0]
    But I found that that results in more professional looking images, and not more realistic photos.
    Adding something like "selfy, Instagram, low resolution, flash" can lead to a .. worse image that looks more realistic.
    [0] I think I did this one with z image turbo on my 4060 ti
    afro884 hours ago
    The blur isn't correct though. Like the amount of blur is wrong for the distance, zoom amount etc. So the depth of field is really wrong even if it conforms to "subject crisp, background blurred"
  - albumen5 hours ago
    Every photoreal image on the demo page has depth of field, it’s just subtle.
- BoredPositron5 hours ago
  Qwen always suffered from their subpar rope implementation and qwen 2 seems to suffer from it as well. The uncanny feel is down to the sparsity of text to image token and the higher in resolution you go the worse it gets. It's why you can't take the higher ends of the MP numbers serious no matter the model. At the moment there is no model that can go for 4k without problems you will always get high frequency artifacts.
- belter5 hours ago
  Agree, looks like the same effect they are applying on YouTube Shorts...
- GaggiX5 hours ago
  For me the only model that can really generate realistic images is nano banana pro (also known as gemini-3-pro-image). Other models are closing the gap, this one is pretty meh in my opinion in realistic images.
  - Mashimo5 hours ago
    You can get flux and maybe z-image to do so, but you have to experiment with the promt a bit. Or maybe get an LoRa to help.
    cubefox4 hours ago
    The examples I saw of z-image look much more realistic than Nano Banana Pro, which is likely using Imagen 4 (plus editing) internally, which isn't very realistic. But Nano Banana Pro has obviously much better prompt alignment than something like z-image.
    GaggiX4 hours ago
    Are you sure you are not confusing nano banana pro for nano banana, z-image still has a bit of AI look that I do not find with nano banana pro, example for a comparison: https://i.ibb.co/YFtxs4hv/594068364-25101056889517041-340369...
    Also Imagen 4 and Nano Banana Pro are very different models.
    cubefox2 hours ago
    In your example, z-image and Nano Banana Pro look basically equally photorealistic to me. Perhaps the NBP image looks a bit more real because it resembles an unstaged smartphone shot with wide angle. Anyway, the difference is very small. I agree the lighting in Flux.2 Pro looks a bit off.
    But anyway, realistic environments like a street cafe are not suited to test for photorealism. You have to use somewhat more fantastical environments.
    I don't have access to z-image, but here are two examples with Nano Banana Pro:
    "A person in the streets of Atlantis, portrait shot." https://i.ibb.co/DgMXzbxk/Gemini-Generated-Image-7agf9b7agf9...
    "A person in the streets of Atlantis, portrait shot (photorealistic)" https://i.ibb.co/nN7cTzLk/Gemini-Generated-Image-l1fm5al1fm5...
    These are terribly unrealistic. Far more so than the Flux.2 Pro image above.
    > Also Imagen 4 and Nano Banana Pro are very different models.
    No, Imagen 4 is a pure diffusion model. Nano Banana Pro is a Gemini scaffold which uses Imagen to generate an initial image, then Gemini 3 Pro writes prompts to edit the image for much better prompt alignment. The prompts above a very simple, so there is little for Gemini to alter, so they look basically identical to plain Imagen 4. Both pictures (especially the first) have the signature AI look of Imagen 4, which is different from other models like Imagen 3.
    By the way, here is GPT Image 1.5 with the same prompts:
    "A person in the streets of Atlantis, portrait shot." https://i.ibb.co/Df8nDHFL/Chat-GPT-Image-10-Feb-2026-14-17-1...
    "A person in the streets of Atlantis, portrait shot (photorealistic)" https://i.ibb.co/Nns4pdGX/Chat-GPT-Image-10-Feb-2026-14-17-2...
    The first is very fake and the second is a strong improvement, though still far from the excellent cafe shots above (fake studio lighting, unrealistic colors etc).
    GaggiX41 minutes ago
    >In your example, z-image and Nano Banana Pro look basically equally photorealistic to me
    I disagree, nano banana pro result is on a completely different league compare to flux.2 and z-image.
    >But anyway, realistic environments like a street cafe are not suited to test for photorealism
    Why? It's the perfect settings in my opinion.
    Btw I don't think you are using nano banana pro, probably standard nano banana, I'm getting this from your prompt: https://i.ibb.co/wZHx0jS9/unnamed-1.jpg
    >Nano Banana Pro is a Gemini scaffold which uses Imagen to generate an initial image, then Gemini 3 Pro writes prompts to edit the image for much better prompt alignment.
    First of all how should you know the architecture details of gemini-3-pro-image, second of all how the model can modify the image if gemini itself is just rewriting the prompt (like old chatgpt+dalle), imagen 4 is just a text-to-image model, not an editing one, it doesn't make sense, nano banana pro can edit images (like the ones you can provide).
    cubefox31 minutes ago
    > I disagree, nano banana pro result is on a completely different league.
    I strongly disagree. But even if you are right, the difference between the cafe shots and the Atlantis shots is clearly much, much larger than the difference between the different cafe shots. The Atlantis shots are super unrealistic. They look far worse than the cafe shots of Flux.2 Pro.
    > Why? It's the perfect settings in my opinion
    Because it's too easy obviously. We don't need an AI to make fake realistic photos of realistic environments when we can easily photograph those ourselves. Unrealistic environments are more discriminative because they are much more likely to produce garbage that doesn't look photorealistic.
    > Btw I don't think you are using nano banana pro, I'm getting this from your prompt: https://i.ibb.co/wZHx0jS9/unnamed-1.jpg
    I'm definitely using Nano Banana Pro, and your picture has the same strong AI look to it that is typical of NBP / Imagen 4.
    > First of all how should you know the architecture details of gemini-3-pro-image, second of all how the model can modify the image if gemini itself is just rewriting the prompt (like old chatgpt+dalle), imagen 4 is just a text-to-image model, not an editing one, it doesn't make sense, nano banana pro can edit images (like the ones you can provide).
    There were discussions about it previously on HN. Clearly NBP is using Gemini reasoning, and clearly the style of NBP strongly resembles Imagen 4 specifically. There is probably also a special editing model involved, just like in Qwen-Imahe-2.0.
cubefox5 hours ago
The complex prompt following ability and editing is seriously impressive here. They don't seem to be much behind OpenAI and Google. Which is backed op by the AI Arena ranking.
4 hours ago
undefined
badhorseman4 hours ago
[dead]
yieldcrv4 hours ago
when the horsey tranq hits
singularfutur4 hours ago
Another closed model dressed up as "coming soon" open source. The pattern is obvious: generate hype with a polished demo, lock the weights, then quietly move on. Real open source doesn't need a press release countdown.
- kkzz993 hours ago
  Good that we have the arbitrator of what "real open source" is and isn't over here.
- yorwba2 hours ago
  Where do you see a press release countdown? Alibaba consistently doesn't release weights for their biggest models, but they also don't pretend that they do.
  - singularfutur14 minutes ago
    fair enough, I read too quickly the article.