Discussion about this post

User's avatar
JP's avatar

Coming to this late but text rendering being the standout makes sense given the Qwen2.5-VL encoder. Most diffusion models treat text as visual noise to reconstruct. Using a proper VLM encoder gives it actual language understanding of what it needs to draw. Curious whether the dual-model workflow holds up past a certain complexity threshold or if it starts falling over.

Vic's avatar

Ha! Qwen has been popping up in various places for me lately, but I haven't been able to find the time to look at it. So happy you've done just that. It looks like Krea is best for image then, but Qwen for text. It would be amazing if someone would come up with a model that delivered on all aspects.

I've been using Omnigen and CogStudio for the last couple of weeks, inside Pinokio. I'm absolutely amazed at how good they both are. Text to image with the multi-modal Omnigen is prompt-adherent without sacrificing quality. Then taking the image and animating it in CogStudio, well, the results have been great and very welcome, especially after months of trial and error with the big, well known T2I, I2V and V2V models.

3 more comments...

No posts

Ready for more?