Today I'm discovering there is a tier of API access with virtually no content moderation available to companies working in that space. I have no idea how to go about requesting that tier of access, but have spoken to 4 different defense contractors in the last day who seem to already be using it.
I kid, more real world use cases would be for concept images for a new product or marketing campaigns.
What an impossibly weird thing to "need" an LLM for.
The real answer is probably way, way more mundane - generating images for marketing, etc.
Then include this image in a dataset of another net with marker "civilian". Train that new neural net better so that it does lower false positive rate when asked "is this target military".
Definitely a premium.
(fwiw for anyone curious how to implement it, it's the 'moderation' parameter in the JSON request you'll send, I missed it for a few hours because it wasn't in Dalle-3)
I just took any indication that the parent post meant absolute zero moderation as them being a bit loose with their words and excitable with how they understand things, there were some signs:
1. it's unlikely they completed an API integration quickly enough to have an opinion on military / defense image generation moderation yesterday, so they're almost certainly speaking about ChatGPT. (this is additionally confirmed by image generation requiring tier 5 anyway, which they would have been aware of if they had integrated)
2. The military / defense use cases for image generation are not provided (and the steelman'd version in other comments is nonsensical, i.e. we can quickly validate you can still generate kanban boards or wireframes of ships)
3. The poster passively disclaims being in military / defense themself (grep "in that space")
4. it is hard to envision cases of #2 that do not require universal moderation for OpenAI's sake, i.e. lets say their thought process is along the lines of: defense/military ~= what I think of as CIA ~= black ops ~= image manipulation on social media, thus, the time I said "please edit this photo of the ayatollah to have him eating pig and say I hate allah" means its overmoderated for defense use cases
5. It's unlikely openai wants to be anywhere near PR resulting from #4. Assuming there is a super secret defense tier that allows this, it's at the very least, unlikely that the poster's defense contractor friends were blabbing about about the exclusive completely unmoderated access they had, to the poster, within hours of release. They're pretty serious about that secrecy stuff!
6. It is unlikely the lack of ability to generate images using GPT Image 1 would drive the military to Chinese models (there aren't Chinese LLMs that do this! even if they were, there's plenty of good ol' American diffusion models!)
OP was clearly implying there is some greater ability only granted to extra special organizations like the military.
With all possible respect to OP, I find this all very hard to believe without additional evidence. If nothing else, I don't really see a military application of this API (specifically, not AI in general). I'm sure it would help them create slide decks and such, but you don't need extra special zero moderation for that.
I can't provide additional evidence (it's defense, duh), but the #1 use I've seen is generating images for computer vision training mostly to feed GOFAI algorithms that have already been validated for target acquisition. Image gen algorithms have a pretty good idea of what a T72 tank and different camouflage looks like, and they're much better at generating unique photos combining the two. It's actually a great use of the technology because hallucinations help improve the training data (i.e. the final targetting should be invariant to a T72 tank with a machine gun on the wrong side or with too many turrets, etc.)
That said, due to compartmentalization, I don't know the extent to which image gen is used in defense, just my little sliver of it.
GCCH is typically 6-12 months behind in feature set.
Now I'm just wondering what the hell defense contractors need image generation for that isn't obviously horrifying...
“Please move them to some desert, not the empire state building.”
“The civilians are supposed to have turbans, not ballcaps.”
I'm confused.
The bias leans towards overfitting the data, which in some use cases - such as missile or drone design which doesn't need broad comparisons like 747s or artillery to complete it's training.
Kind of like neural net back propogation but in terms of model /weights
Prompt: “a cute dog hugs a cute cat”
https://x.com/terrylurie/status/1915161141489136095
I also then showed a couple of DALL:E 3 images for comparison in a comment
“Auto” is just whatever the best quality is for a model. So in this case it’s the same as “high”.
This prompt is best served by Midjourney, Flux, Stable Diffusion. It'll be far cheaper, and chances are it'll also look a lot better.
The place where gpt-image-1 shines if if you want to do a prompt like:
"a cute dog hugs a cute cat, they're both standing on top of an algebra equation (y=\(2x^{2}-3x-2\)). Use the first reference image I uploaded as a source for the style of the dog. Same breed, same markings. The cat can contrast in fur color. Use the second reference image I uploaded as a guide for the background, but change the lighting to sunset. Also, solve the equation for x."
gpt-image-1 doesn't make the best images, and it isn't cheap, and it isn't fast, but it's incredibly -- almost insanely -- powerful. It feels like ComfyUI got packed up into an LLM and provided as a natural language service.
https://github.com/Alasano/gpt-image-1-playground
Openai's Playground doesn't expose all the API options.
Mine covers all options, has built in mask creation and cost tracking as well.
Enhance headshots for putting on Linkedin.
A pro/con of the multimodal image generation approach (with an actually good text encoder) is that it rewards intense prompt engineering moreso than others, and if there is a use case that can generate more than $0.17/image in revenue, that's positive marginal profit.
- specificity (a diagram that perfectly encapsulates the exact set of concepts you're asking about)
AI companies are still in their "burning money" phase.
Enshittification is not on the horizon yet, but it's inevitable.
Prompting the model is also substantially more different and difficult than traditional models, unsurprisingly given the way the model works. The traditional image tricks don't work out-of-the-box and I'm struggling to get something that works without significant prompt augmentation (which is what I suspect was used for the ChatGPT image generations)
I think in terms of image generation, ChatGPT is the biggest leap since Stable Diffusion's release. LoRA/ControlNet/Flux are forgettable in comparison.
The AI field looks awfully like {OpenAI, Google, The Irrelevent}.
Now for the bad part: I don't think Black Forest Labs, StabilityAI, MidJourney, or any of the others can compete with this. They probably don't have the money to train something this large and sophisticated. We might be stuck with OpenAI and Google (soon) for providing advanced multimodal image models.
Maybe we'll get lucky and one of the large Chinese tech companies will drop a model with this power. But I doubt it.
This might be the first OpenAI product with an extreme moat.
Yeah. I'm a tad sad about it. I once thought the SD ecosystem proves open-source won when it comes to image gen (a naive idea, I know). It turns out big corps won hard in this regard.
I started Accomplice v1 back in 2021 with this goal in mind and raised some VC money but it was too early.
Now, with these latest imagen-3.0-generate-002 (Gemini) and gpt-image-1 (OpenAI) models – especially this API release from OpenAI – I've been able to resurrect Accomplice as a little side project.
Accomplice v2 (https://accomplice.ai) is just getting started back up again – I honestly decided to rebuild it only a couple weeks ago in preparation for today once I saw ChatGPT's new image model – but so far 1,000s of free to download PNGs (and any SVGs that have already been vectorized are free too (costs a credit to vectorize)).
I generate new icons every few minutes from a huge list of "useful icons" I've built. Will be 100% pay-as-you-go. And for a credit, paid users can vectorize any PNGs they like, tweak them using AI, upload their own images to vectorize and download, or create their own icons (with my prompt injections baked in to get you good icon results)
Do multi-modal models make something like this obsolete? I honestly am not sure. In my experience with Accomplice v1, a lot of users didn't know what to do with a blank textarea, so the thinking here is there's value in doing some of the work for them upfront with a large searchable archive. Would love to hear others' thoughts.
But I'm having fun again either way.
It definitely captures the style - but any reasonably complicated prompt was beyond it.
maybe OpenAI thinks model business is over and they need to start sherlocking all the way from the top to final apps (Thus their interest on buying out cursor, finally ending up with windsurf)
Idk this feels like a new offering between a full raw API and a final product where you abstract some of it for a few cents, and they're basically bundling their SOTA llm models with their image models for extra margin
In case you didn’t know, it’s not just wrapping in an LLM. The image model they’re referencing is a model that’s directly integrated into the LLM for functionality. It’s not possible to extract, because the LLM outputs tokens which are part of the image itself.
That said, they’re definitely trying to focus on building products over raw models now. They want to be a consumer subscription instead of commodity model provider.
waiting for some FOSS multi-modal model to come out eventually too
great to see openAI expanding into making actual usable products i guess
I suspect what I'll do with the API is iterate at medium quality and then generate a high quality image when I'm done.
Can you elaborate? This was not my experience - retesting the prompts that I used for my GenAI image shootout against gpt-image-1 API proved largely similar.
Here's my dog in a pelican costume: https://bsky.app/profile/simonwillison.net/post/3lneuquczzs2...
While there's a market need for fast diffusion, that's already been filled and is now a race to the bottom. There's nobody else that can do what OpenAI does with gpt-image-1. This model is a truly programmable graphics workflow engine. And this type of model has so much more value than mere "image generation".
gpt-image-1 replaces ComfyUI, inpainting/outpainting, LoRAs, and in time one could imagine it replaces Adobe Photoshop and nearly all the things people use it for. It's an image manipulation engine, not just a diffusion model. It understands what you want on the first try, and it does a remarkably good job at it.
gpt-image-1 is a graphics design department in a box.
Please don't think of this as a model where you prompt things like "a dog and a cat hugging". This is so much more than that.
It's actually more than that. It's about 16.7 cents per image.
$0.04/image is the pricing for DALL-E 3.
They didn't show low/med/high quality, they just said an image was a certain number of tokens with a price per token that led to $0.16/image.
- OpenAI is notorious for blocking copyrighted characters. They do prompt keyword scanning, but also run a VLM on the results so you can't "trick" the model.
- Lots of providers block public figures and celebrities.
- Various providers block LGBT imagery, even safe for work prompts. Kling is notorious for this.
- I was on a sales call with someone today who runs a father's advocacy group. I don't know what system he was using, but he said he found it impossible to generate an adult male with a child. In a totally safe for work context.
- Some systems block "PG-13" images of characters that are in bathing suits or scantily clad.
None of this is porn, mind you.
This is the god model in images right now.
I don't think open source diffusion models can catch up with this. From what I've heard, this model took a huge amount of money to train that not even Black Forest Labs has access to.
As for LoRAs and fine tuning and open source in general; if you've ever been to civit.ai it should be immediately obvious why those things aren't going away.
As an example, some users (myself included) of a generative image app were trying to make a picture of person in the pouch of a kangaroo.
No matter what we prompted, we couldn’t get it to work.
GPT-4o did it in one shot!
And you seem to be right, though the only reference I can find is in one of the example images of a whiteboard posted on the announcement[0].
It shows: tokens -> [transformer] -> [diffusion] pixels
hjups22 on Reddit[1] describes it as:
> It's a hybrid model. The AR component generates control embeddings that then get decoded by a diffusion model. But the control embeddings are accurate enough to edit and reconstruct the images surprisingly well.
[0]https://openai.com/index/introducing-4o-image-generation/
[1]https://www.reddit.com/r/MachineLearning/comments/1jkt42w/co...
> Unlike DALL·E, which operates as a diffusion model, 4o image generation is an autoregressive model natively embedded within ChatGPT.
[0]https://cdn.openai.com/11998be9-5319-4302-bfbf-1167e093f1fb/...
Why would that be more likely ? It seems like some implementation of bytedance's VAR.
Text input tokens (prompt text): $5 per 1M tokens Image input tokens (input images): $10 per 1M tokens Image output tokens (generated images): $40 per 1M tokens
In practice, this translates to roughly $0.02, $0.07, and $0.19 per generated image for low, medium, and high-quality square images, respectively.
that's a bit pricy for a startup.
Anyone know of an Ai model for generating svg images? Please share.
One note with these is most of the production ones are actually diffusion models that get ran through an image->svg model after. The issue with this is that the layers aren't set up semantically like you'd expect if you were crafting these by hand, or if you were directly generating svgs. The results work, but they aren't perfect.
first paper linked on prior comment is the latest one from SVGRender group, but not sure if any runnable model weights are out yet for it (SVGFusion)
What's the current state of the art for API generation of an image from a reference plus modifier prompt?
Say, in the 1c per HD (1920*1080) image range?
The interesting part of this GPT-4o API is that it doesn't need to learn them. But given the cost of `high` quality image generation, it's much cheaper to train a LoRA for Flux 1.1 Pro and generate from that.
I remember meeting someone on Discord 1-2 years ago (?) working on a GoDaddy effort to have customer-generated icons using bespoke foundation image gen models? Suppose that kind of bespoke model at that scale is ripe for replacement by gpt-image-1, given the instruction-following ability / steerability?
Does this mean this also does video in some manner?
You might say, "but chatGPT is already as dead simple an interface as you can imagine". And the answer to that is, for specific tasks, no general interface is ever specific enough. So imagine you want to use this to create "headshots" or "linkedin bio photos" from random pictures of yourself. A bespoke interface, with options you haven't even considered already thought through for you, and some quality control/revisions baked into the process, is something someone might pay for.
I can’t see a way to do this currently, you just get a prompt.
This, I think, is the most powerful way to use the new image model since it actually understands the input image and can make a new one based on it.
Eg you can give it a person sitting at a desk and it can make one of them standing up. Or from another angle. Or in the moon.
PermissionDeniedError: Error code: 403 - {'error': {'message': 'To access gpt-image-1, please complete organization verification
So the options are: 1) nerf the model so it can't produce images like that, or 2) use some type of KYC verification.
Upload a picture of a friend -> OK. Upload my own picture -> I can't generate anything involving real people.
Also after they enabled global chat memory I started seeing my other chats leaking into the images as literal text. Disabled it since.
EDIT: Oh, yes, that's what it appears to be. Is it better? Why would I switch?