I think it is the fine tuning, because you can find AI photos that look more like real ones. I guess people prefer obviously fake looking 'picturesque' photos to more realistic ones? Maybe it's just because the money is in selling to people generating marketing materials? NB is clearly the only model here which permits a half eaten burrito to actually appear to have been bitten.
It looks like they took the page down now though...
I get that it's allows ensuring you're testing the model capabilities vs prompts, but most models are being post-trained with very different formats of prompting.
I use Seedream in production so I was a little suspicious of the gap: I passed Bytedance's official prompting guide, OPs prompt, and your feedback to Claude Opus 4.5 and got this prompt to create a new image:
> A partially eaten chicken burrito with a bite taken out, revealing the fillings inside: shredded cheese, sour cream, guacamole, shredded lettuce, salsa, and pinto beans all visible in the cross-section of the burrito. Flour tortilla with grill marks. Taken with a cheap Android phone camera under harsh cafeteria lighting. Compostable paper plate, plastic fork, messy table. Casual unedited snapshot, slightly overexposed, flat colors.
Then I generated with n=4 and the 'standard' prompt expansion setting for Seedream 4.0 Text To Image:
They're still not perfect (it's not adhering to the fillings being inside for example) but it's massively better than OP's result
Shows that a) random chance plays a big part, so you want more than 1 sample and b) you don't have to "cheat" by spending massive amounts of time hand-iterating on a single prompt either to get a better result
Including a "total rolls" is a very valuable metric since it helps indicate how steerable the model is.
But individual users usually iterate/pick, so just sharing a blurb about your preference is probably enough if you choose 1 of n
It's just not as plasticy and oversaturated as the others.
The table grain is the only thing that gives it away - if it weren't for that no one without advance warning is going to notice that it's not real.
I agree with you. The Nano Banana Pro burrito is almost perfect, the wood grain direction/perspective is the only questionable element.
Almost no one would ID that as being AI.
And yeah, the focal plane is wonky. If you try to draw a box around what's in focus, you end up with something that does not make sense given where the "camera" is - like the focal plane runs at a diagonal - so you have the salsa all in perfect focus, but for some reason one of the beans which appears to be the exact same distance away, is subtly out of focus.
I mean, it's not bad, but it doesn't actually look like a real burrito either. That said, I'm not sure how much I'd notice at a casual glance.
Earlier this week I did some A/B testing with AV1 and HEVC video encoding. For similar bit rates there was a difference but I had to know what to look for and needed to rapidly cycle between a single frame from both files and even then... barely. The difference disappeared when I hit play and that's after knowing what to look for.
For anyone curious: if you are targeting 5-10 Mbps from a Bluray source AV1 will end up slightly smaller (5-10%) with slightly more retention of film grain in darker areas. Target 10 Mbps with a generous buffer (25 MB) and a max bit rate (25 Mbps) and you'll get really efficient bit rates in dark scenes and build up a reserve of bandwidth for confetti-like situations. The future is bright for hardware video encoding/decoding with royalty-free codecs. Conclusion: prefer AV1 for 5-10 Mbps but it's no big deal if it's not an option.
The “partially eaten” part of the prompt is interesting…everyone knows what a half-eaten burrito looks like but clearly the computers struggle.
For some reason ever since DALL-E 2, all food models seem to generate obviously fake food and/or misinterpret the fun constraints...until Nano Banana. Now I can generate fractal Sierpiński triangle peanut butter and jelly sandwiches.
I can kind of see what you mean in that it went for realism in the aesthetics, but not the object... but that last one would probably fool me if I was scrolling
Even ignoring the Heinz bean outliers, these are all decidedly Scottsdale. With one exception. All hail Nano Banana.
Do people get burritos with beans in them more or less as pictured? Aesthetically, it seems like it'd look pretty appealing if you were someone who loved beans compared to what I had in mind, but again I'm really in no position to judge these images based on bean appearance.
1. The text encoders are primitive (e.g. CLIP) and have difficulty with nuance, such as "partially eaten", and model training can only partially overcome it. It's the same issue with the now-obsolete "half-filled" wine glass test.
2. Most models are diffusion-based, which means it denoises the entire image simultaneously. If it fails to account for the nuance in the first few passes, it can't go back and fix it.
I believe some image generation AIs were RLHFed like chat bot LLMs, but moreso to improve aesthetics rather than prompt adherence.