What it is: A 6B-parameter diffusion model that runs surprisingly fast. On my RTX 4090, I'm getting results in under a second with 8-9 sampling steps. The VRAM footprint is reasonable enough to run locally without enterprise hardware.
The interesting part: Text rendering actually works. If you've tried generating images with text using other models, you know the pain – garbled letters, missing characters, nonsensical glyphs. This one handles both English and Chinese text with decent accuracy. Not perfect, but noticeably better than what I've seen elsewhere.
Technical bits:
Single-stream DiT architecture Works with ComfyUI (there's a workflow floating around) LoRA training is supported The model weights are on Hugging Face under Tongyi-MAI What I'm using it for: Mostly quick mockups and thumbnail generation where readable text matters. The speed makes iteration painless.
Curious if anyone else has been playing with this. Would love to hear about edge cases or interesting use cases you've found.