https://genai-showdown.specr.net/image-editing
It scored slightly higher than BFL's Kontext model, coming in around the middle of the pack at 6 / 12 points.
I’ll also be introducing an additional numerical metric soon, so we can add more nuance to how we evaluate model quality as they continue to improve.
If you're solely interested in seeing how Flux 2 Pro stacks up against the Nano Banana Pro, and another Black Forest model (Kontext), see here:
https://genai-showdown.specr.net/image-editing?models=km,nbp...
Note: It should be called out that BFL seems to support a more formalized JSON structure for more granular edits so I'm wondering if accuracy would improve using it.
It's pretty obvious that OpenAI is terrible at it -- it is known for its unmissable touch. However, for Flux it really depends on the style. They already posted at some point that they changed their training to avoid averaging different styles together, which is the ultimate AI look. But this is at odds with the goal to directly generate images that are visually appealing, so the style matching is going to be a problem for a while, at least.
Generative: https://genai-showdown.specr.net
Editing: https://genai-showdown.specr.net/image-editing
Style is mostly irrelevant for editing, since the goal is to integrate seamlessly with the existing image. The focus is on performing relatively surgical edits or modifications to existing imagery while minimizing changes to the rest of the image. It is also primarily concerned with realism, though there are some illustrative examples (the JAWS poster, Great Wave off Kanagawa).
This contrasts with the generative section though even then the emphasis is on prompt adherence, and style/fidelity take a backseat (which is honestly what 99% of existing generative benchmarks already focus on).
If you look for example at "Mermaid Disciplinary Committee", every single image is in a very different style, each that you can consider a default of what the model assume would be for the specific prompt. It's quite obvious that these styles were 'baked in' the models, and it's not clear how much you can steer in a specific style. If you look at "The Yarrctic Circle", a lot more models default to a kind of "generic concept art" style (the "by greg rutkowski" meme) but even then I would classify the results as at least 5 distinct styles. So for me this benchmark is not checking style at all, unless you consider style to be just around 4 categories (cartoon, anime, realistic, painterly).
So regarding image editing, I did my own tests at the first release of Flux tools, and found that it was almost impossible to get any decent results on some specific styles, specifically cartoon and concept art styles. I think the tools focus on what imaginary marketing people would want (like "put this can of sugary beverage into an idyllic scene") rather than such use cases. So editing like "color this" or other changes would just be terrible, and certainly unusable.
https://woolion.art/assets/img/ai/ai_editing.webp
It's original, ChatGPT, Flux.
Still, you can see that ChatGPT just throw everything out and does not do a minimal attempt at respecting style. Flux is quite bad, but it follows the design much more (although it gets completely confused by it) that it seems that with a whole lot of work you could get something out of it.
"Using the attached images as stylistic references, create an image of X"
It's fall down pretty hard.
https://imgur.com/a/failed-style-transfer-nb-pro-o3htsKn
Now you should be able to see that the generated image is stylistically not even close to the references (which are early works by Yoichi Kotabe). Pay careful attention to the characters.
With locally hostable models, you can try things like Reference/Shuffle ControlNets but that's not always successful either.
Seedream is also very good and makes me think the next version will challenge Google for SOTA image gen
Increasingly feels like image gen is a solved problem
Also it doesn't feel solved to me at all. There is no general model, perhaps it cannot reasonably exist. I think these tests are benchmarks are smart, but they don't show the whole picture.
Domain specific image generation tasks still require a domain specific models. For art purposes SD1.5 with specialized and finely tuned checkpoints will still provide the best results by far. It is also limited, but I think it dampened the hype for new image generators significantly.
I understand most outputs could be fine tuned for most domains, but still felt sd1.5 had a resolution ceiling, and a complexity ceiling no matter how good the fine tuning
There's not much of a reason to use SD 1.5 over SDXL if image quality is paramount.
A lot of people (myself included) use a pipeline that involves using Flux to get the basic action / image correct, then SDXL as a refiner and finally a decent NMKD-based upscaler.
I don't know if I'm going to get as granular as 1-10 only because the finer the scoring - the more potential for subjectivity. That's why it was initially set up as a "Minimum Passing Criteria Rule Set" along with a Pass/Fail grade.
A suggestion from a previous HN post was something along the lines of (0 Fail, 0.5 Technical Pass, 1.0 Proficient Pass).
If their new fancy model is only middle of the pack, and they're not as open source as the Chinese Qwen image models (or ByteDance / Alibaba / Lightricks video models), what's the point?
It's not just prompt adherence, the image quality of Flux models has been pretty bad. Plastic skin, inhumanely chiseled chins, that general faux "AI" aura.
Indeed, the Flux samples in your test suite that "pass" look God-awful. It might "pass" from a technical standpoint, but there's no way I'd choose Flux to solve my workflows. It looks bad.
(I wonder if they lack people on their data team with good aesthetic taste. It may be as simple as that.)
I think this company is struggling. They're pinned between Google and the Chinese. It's a tough, unenviable spot to be in.
I think a lot of the foundation model companies in media are having a really hard time: RunwayML, PikaLabs, LumaLabs. Some of them have pivoted hard away from solving media for everyone. I don't think they can beat the deep-pocketed hyperscalers or the Chinese ecosystem.
BFL just raised a massive round, so what do I know? I just can't help but feel that even though Runway raised similar money, they're struggling really hard now. And I would really not want to be fighting against Google who is already ahead in the game.
in fact, it seems like BFL has benefited a lot by becoming the go-to alternative for big enterprise customers who don't want to be dependent on google
That's why they raised the massive round, then.
But this just leads to more questions - I have to wonder if and for how long this is just going to be to plug in a gap for Meta's own AI product offering. At some point they'll want to build their own in-house models or perhaps just acquire BFL. Zuckerberg would not be printing AI data centers if that wasn't the case.
From a PLG standpoint, Flux isn't really what graphics designers are choosing for their work. The generations look worse than OpenAI's "piss filter". But aesthetics might not be the play the team is going after.
Hopefully they don't just raise all of this dry powder energy and burn it trying to race Google. They should start listening to designers and get in their good graces if their intent is to build tools for art and graphics design work.
A good press release would consist of lots of good looking images and a video of workflows that save artists time. This press release doesn't connect with graphics designers at all and it reads as if they aren't even the audience.
If it's something else, more "enterprise", that BFL is after, then maybe I don't know the strategy or game plan.
the Chinese models are great, but no serious enterprise developer is going to bet their image workloads at scale in production on Chinese models if the market evolves anything like past developer infrastructure
I wonder if this architectural change makes it easier to use other vision models such as the ones in Llama 3 and 4, or possibly a future Llama 5.
Flux 2 Pro only scored a single point higher than the Kontext models they released over half a year ago.
The text-to-image side was even more frustrating. It often felt like it was actively fighting me, as evidenced by the high number of re-rolls required before it passed some of the tests (Cubed⁵, for example).
If they have so much data, then why do Flux model outputs look so God-awful bad?
They have plastic skin, weird chins, and have that "AI" aura. Not the good AI aura, mind you. The cheap automated YouTube video kind that you immediately skip.
Flux 2 seems to suffer from the exact same problems.
Midjourney is ancient. Their CEO is off trying to build a 3D volume and dating companion or some nonsense and leaving the product without guidance and much change. It almost feels abandoned. But even so, Midjourney has 10,000x better aesthetics despite having terrible prompt adherence and control. Midjourney images are dripping with magazine spread or Pulitzer aesthetics. It's why Zuckerberg went to them to license their model instead of quasi "open source" BFL.
Even SDXL looks better, and that's a literal dinosaur.
Most of the amazing things you see on social media either come from Midjourney or SDXL. To this day.
I’m not saying you are wrong in effect, but for reference just slightly over 2 years ago was SDZL released, and it took about a year to have great fine tunes.
LTX's first model felt two years behind SOTA when it launched, but they viewed it as a success and kept going.
The investment initially is low and can scale with confidence.
BFL goes radio silent and then drops stuff. Now they're dropping stuff that is clearly middle of the pack.
I'd take it with a grain of salt; these people are chainsaw jugglers and know what they're doing, so any sort of major hiccup was probably planned for. They'd have plan b and c, at a minimum, and be ready to switch - the work isn't deterministic, so you have to be ready for failures. (If you sense an imminent failure, don't grab the spinny part of the chainsaw, let it fall and move on.)
a ‘major training run’ only becomes major after you sample from it iteratively every few thousand steps, check its good, fix your pipeline, then continue
almost by design, major training runs don’t fail
if I had to guess, like most labs. they’ve probably had to reallocate more time and energy to their image models than expected since the AI image editing market has exploded in size this year, and will do video later
If they found that their architecture worked better on static images then it is better to pivot to that than wasting the effort. Especially if you have a trained model that is good at producing static images and bad at generating video.
Almost all of the control in image-to-video comes through an image. And image models still needs a lot of work and innovation.
On a real physical movie set, think about all of the work that goes into setting the stage. The set dec, the makeup, the lighting, the framing, the blocking. All the work before calling "action". That's what image models do and must do in the starting frame.
We can get way more influence out of manipulating images than video. There are lots of great video models and it's highly competitive. We still have so much need on the image side.
When you do image-to-video, yes you control evolution over time. But the direction is actually lower in terms of degrees of freedom. You expect your actors or explosions to do certain reasonable things. But those 1024x1024xRGB pixels (or higher) have way more degrees of freedom.
Image models have more control surface area. You exercise control over more parameters. In video, staying on rails or certain evolutionary paths is fine. Mistakes can not just be okay, they can be welcome.
It also makes sense that most of the work and iteration goes into generating images. It's a faster workflow with more immediate feedback and productivity. Video is expensive and takes much longer. Images are where the designer or director can influence more of the outcomes with rapidity.
Image models still need way more stylistic control, pose control (not just ControlNets for limbs, but facial expressions, eyebrows, hair - everything), sets, props, consistent characters and locations and outfits. Text layout, fonts, kerning, logos, design elements, ...
We still don't have models that look as good as Midjourney. Midjourney is 100x more beautiful than anything else - it's like a magazine photoshoot or dreamy Instagram feed. But it has the most lackluster and awful control of any model. It's a 2021-era model with 2030-level aesthetics. You can't place anything where you want it, you can't reuse elements, you can't have consistent sets... But it looks amazing. Flux looks like plastic, Imagen looks cartoony, and OpenAI GPT Image looks sepia and stuck in the 90's. These models need to compete on aesthetics and control and reproducibility.
That's a lot of work. Video is a distraction from this work.
I've heard chairs of animation departments say they feel like this puts film departments under them as a subset rather than the other way around. It's a funny twist of fate, given that the tables turned on them ages ago.
Photorealistic models are just learning the rules of camera optics and physics. In other "styles", the models learn how to draw Pixar shaded volumes, thick lines, or whatever rules and patterns and aesthetics you teach.
Different styles can reinforce one another across stylistic boundaries and mixed data sets can make the generalization better (at the cost of excelling in one domain).
"Real life", it seems, might just be a filter amongst many equally valid interpretations.
Porn, obviously, though if you look at what's popular on civitai.com, a lot of it isn't photo-realistic. That might change as photo-realistic models are fully out of the uncanny valley.
Presumably personalized advertising, but this isn't something we've seen much of yet. Maybe this is about to explode into the mainstream.
Perhaps stock-photo type images for generic non-personalized advertising? This seems like a market with a lot of reach, but not much depth.
There might be demand for photos of family vacations that didn't actually happen, or removing erstwhile in-laws from family photos after a divorce. That all seems a bit creepy.
I could see some useful applications in education, like "Draw a picture to help me understand the role of RNA." But those don't need to be photo-realistic.
I'm sure people will come up with more and better uses for AI-generated images, but it's not obvious to me there will be more demand for images that are photo-realistic, rather than images that look like illustrations.
I don't have an argument to make on the main point, but Civitai has a whole lot of structural biases built into it (both intentionally and as side effects of policies that probably aren't intended to influence popularity in the way they do) that I would hesitate to use "what is popular on Civitai" as a guide to "what is attractive to (or commercially viable in) the market", either for AI imagery in general or for AI imagery in the NSFW domain specifically.
Replace commercial stock imagery. My local Home Depot has a banner by one of the cash registers with an AI house replete with mismatched trim and weird structural design but it's passable at a glance.
Midjourney is one aesthetically pleasing data point in a wide spectrum of possibilities and market solutions.
Creator economy is huge and is outgrowing Hollywood and the Music Industry combined.
There's all sorts of use cases in marketing, corporate, internal comms.
There are weird new markets. A lot of people simply subscribe to Midjourney for "art therapy" (a legit term) and use it as a social media replacement.
The giants are testing whether an infinite scroll of 100% AI content can beat human social media. Jury's out, but it might start to chip away at Instagram and TikTok.
Corporate wants certain things. Disney wants to fine tune. They're hiring companies like MoonValley to deliver tailored solutions.
Adobe is building tools for agencies and designers. They are only starting to deliver competent models (see their conference videos), and they're going about this a very different way.
ChatGPT gets the social trend. Ghibli. Sora memes.
> Porn, obviously, though if you look at what's popular on civitai.com, a lot of it isn't photo-realistic.
Civitai is circling the drain. Even before the unethical and religious Visa blacklisting, the company was unable to steer itself to a Series A. Stable Diffusion and local models are still way too hard for 99.99% of people and will never see the same growth as a Midjourney or OpenAI that have zero sharp edges and that anyone in the world can use. I'm fairly certain an "OnlyFans but AI" will arise and make billions of dollars. But it has to be so easy a tucker who doesn't learn to code can use it from their 11 year old Toshiba.
> Presumably personalized advertising, but this isn't something we've seen much of yet.
Carvana pioneered this almost five years ago. I'll try to find the link. This isn't going to really take off though. It's creepy and people hate ads. Carvana's use case was clever and endearing though.
If I want an "illustration" I can type in "illustration of a cat". Though of course that's still quite unspecific. There are countless possible unrealistic styles for pictures (e.g. line art, manga, oil painting, vector art etc), and the reasonable thing is that the users should specify which of these countless unrealistic styles they want, if they want one. If I just type in "cat" and the model gives me, say, a water color picture of a cat, it is highly improbable that this style happens to be actually what I wanted.
I think we'll probably need a few more hardware generations before it becomes feasible to use chatgpt 5 level models with integrated image generation. The underlying language model and its capabilities, the RL regime, and compute haven't caught up to the chat models yet, although nano-banana is certainly doing something right.
See my third comparison in Nano Banana blog post: https://quesma.com/blog/nano-banana-pro-intelligence-with-to...
Some notes:
- Running my nuanced Nano Banana prompts though Flux 2, Flux 2 definitely has better prompt adherence than Flux 1.1, but in all cases the image quality was worse/more obviously AI generated.
- The prompting guide for Flux 2 (https://docs.bfl.ai/guides/prompting_guide_flux2) encourages JSON prompting by default, which is new for an image generation model that has the text encoder to support it. It also encourages hex color prompting, which I've verified works.
- Prompt upsampling is an option, but it's one that's pushed in the documentation (https://github.com/black-forest-labs/flux2/blob/main/docs/fl...). This does allow the model to deductively reason, e.g. if asked to generate an image of a Fibonacci implementation in Python it will fail hilariously if prompt sampling is disabled, but get somewhere if it's enabled: https://x.com/minimaxir/status/1993361220595044793
- The Flux 2 API will flag anything tangently related to IP as sensentive even at its lowest sensitivity level, which is different from Flux 1.1 API. If you enable prompt upsampling, it won't get flagged, but the results are...unexpected. https://x.com/minimaxir/status/1993365968605864010
- Costwise and generation-speed-wise, Flux 2 Pro is on par with Nano Banana, and adding an image as an input pushes the cost of Flux 2 Pro higher than Nano Banana. The cost discrepancy increases if you try to utilize the advertised multi-image reference feature.
- Testing Flux 1.1 vs. Flux 2 generations does not result in objective winners, particularly around more abstract generations.
I am curious to see how the Apache 2.0 distilled variant performs but it's still unlikely that the economics will favor it unless you have a specific niche use case: the engineering effort needed to scale up image inference for these large models isn't zero cost.
I personally prefer Qwen's performance here. I'm waiting to see other folks' takes.
The Qwen folks are also a lot more transparent, spend time community building, and iterate on releases much more rapidly. In the open rather than behind closed doors.
I don't like how secretive BFL is.
EDIT: Seeing a few generations on /r/StableDiffusion generating IP from the open weights model.
Glad to see that they're sticking with open weights.
That said, Flux 1.x was 12B params, right? So this is about 3x as large plus a 24B text encoder (unless I'm misunderstanding), so it might be a significant challenge for local use. I'll be looking forward to the distill version.
Downloading over 100GB of model weights is a tough sell for the local-only hobbyists.
That's great, and I love the little laptop for the amount of x86 perf it can pack into so little cooling, but my used Epyc box of ~the same price is usually faster for AI (despite the complete lack of video card) and able to load models 3x the size (well, before RAM prices doubled this last month) because it has modular 12 channel RAM and memory speeds this low don't really need a GPU to keep up with the matrix math. Meanwhile, Flux is already slow when it's on actual real high bandwidth dedicated GPU memory VRAM.
So the only option will be [klein] on a single GPU... maybe? Since we don't have much information.
It takes about 40GB with the fp8 version fully loaded, but ComfyUI can (at reduced speed), with enough system RAM available, partially load models in VRAM during inference and swap at need (the NVidia page linked in the BFL announcement specifically highlights NVidia working with ComfyUI to improve this existing capacity specifically to enable Flux.2) to run on systems with too little VRAM to fully load the model.
The pricing structure on the Pro variant is...weird:
> Input: We charge $0.015 for each megapixel on the input (i.e. reference images for editing)
> Output: The first megapixel is charged $0.03 and then each subsequent MP will be charged $0.015
Qwen-Image-Edit-2511 is going to be released next week. And it will be Apache 2.0 licensed. I suspect that was one of the factors in the decision to release FLUX.2 this week.
Yeah, CLIP here was essentially useless. You can even completely zero the weights through which the CLIP input is ingested by the model and it barely changes anything.
This method was used in tons of image generation models. Not saying it's superior or even a good idea, but it definitely wasn't "weird".
https://huggingface.co/black-forest-labs/FLUX.2-dev/blob/mai...
So, it’s not open source.
anyone found this? To me the link doesn't lead to the model
In the case of Flux 2 Pro, adding just one image increases the total cost to be greater than a Nano Banana generation.
Wow, the Krea relationship soured? These are both a16z companies and they've worked on private model development before. Krea.1 was supposed to be something to compete with Midjourney aesthetics and get away from the plastic-y Flux models with artificial skin tones, weird chins, etc.
This list of partners includes all of Krea's competitors: HiggsField (current aggregator leader), Freepik, "Open"Art, ElevenLabs (which now has an aggregator product), Leonardo.ai, Lightricks, etc. but Krea is absent. Really strange omission.
I wonder what happened.
They put our logo after we pointed it out.
Nice eye!
it's pointless to compare in pure output when one is set in stone and the other can be built upon.