Qwen Image Model - New Open Source Leader?
Text rendering and prompt adherence on a new level
There has been some excitement over the last week or two around the new model in the Qwen series by Alibaba. Qwen Image is a 20B parameter - that’s 3 billion more than HiDream - MMDiT (Multimodal Diffusion Transformer) model, open-sourced under the Apache 2.0 license.
As well as the features of the core model it also uses the Qwen2.5-VL LLM for text encoding and has a specialised VAE (Variational Autoencoder). It supposedly can render readable, multilingual text in much longer forms than previous models and the VAE is trained to preserve small fonts, text edges and layout. Using Qwen2.5-VL as the text encoder should mean better language, vision and context understanding.
The encoder uses two streams, one focussed on the semantics (what the image means) and one tailored to reconstructive (what it looks like). It’s not clear to me where that sits in comparison with say HiDream I1 which can use four different encoders including Ollama, which is also an LLM.
These improvements come at a cost: size. The full BF16 model is 40GB in size, with the FP16 version of the text encoder coming in at an additional 16GB. FP8 versions are more reasonable at 20GB for the model and 9GB for the text encoder. If those sizes are still too large for your set up, then there are distilled versions available from links on the ComfyUI guide. City96 has also created various GGUF versions available for download from Hugging Face.
Using Qwen Image
My current set-up is only a 16G VRAM 4060 Ti OC, with 64GB of system memory so there was no way the full model would run, so I settled for the FP8 versions which in general have run fine. As a very rough guide I am getting about 8s per step with Qwen Image, compared to 7s for HiDream and 3.7s for Flux.1 Dev.
All the testing below is on the FP8 version, it would be interesting to know whether the full FP16 model gives any noticeable improvement and whether the trade off of performance and hardware requirements would be worth it. When I have time I may try the full version on Runpod or MimicPC.
You can use an accelerator LoRA such as Lightx2v to reduce the number of steps to 8, however, even without an accelerator the quality is good at 20-30 steps, so although the per step time is slower than HiDream, the overall time is quicker as HiDream I generally run at 50 steps.
The ComfyUI documentation has some basic workflows for using Qwen Image and this is where I started. The workflow approach for Qwen Image is no different to other models, it supports negative prompts which adds another angle of control. The Shift parameter controls the level of detail/blur (which I have kept at 3.1), with the CFG parameter acting in the normal way in terms of prompt adherence. The default is 4, however, I have been using 2.5. I have not tested this, but if you set CFG to 1 and avoid negative prompts then apparently it runs a bit faster.
The guidance for steps is 50, but I am finding 30 steps is fine. Sampler/Scheduler selection is a personal preference based on content but dpmpp_2m/sgm_uniform have worked well for me.
Semi-Realistic Anime
First test is an anime prompt which I was expecting strong results from Qwen Image. Although Flux.1 Krea would not normally be a model I would use for anime I have used it here for consistency as all the later comparisons are also with Krea. As a second comparison I have included HiDream I1.



As expected it is a strong result from Qwen Image. The softness is not particularly to my liking and it is not as detailed as Krea or HiDream. Qwen Image does the best job of the text on the t-shirt, HiDream is also good but that was partly due to the HiRes Fix upscaling.
Krea has gone for quite a busy image with more of an artistic style and it isn’t a great output from Krea. HiDream sits between the two, more detailed than Qwen but not as artistic as Krea. None of these images used LoRAs, which often would play a role, so for completeness the image below is Flux.1 Dev using the Anime Art v3 LoRA.
Note the text on the t-shirt is virtually non-existent, but it is my preferred output if you like that style.
Artistic Watercolour


Qwen Image again generates quite a soft image, more anime than watercolour and does not have a great deal of detail. Krea does a much better job in this case, however, both models have added butterflies and bees which look like they were stuck on afterwards. In this example both models required several runs before they had an acceptable composition and anatomy. On a couple of runs they both had added two hats and the number of fingers was excessive!
Photorealistic
There has been some discussion around the softness and lack of detail in photorealistic images from Qwen Image, leading to the smooth, plastic AI skin look. I think it depends a lot on individual images and Qwen probably sits between Flux.1 Dev and HiDream, in some cases the raw Qwen output has looked fine, in others it has been rather plastic.
To add more detail and realism people have suggested the approach of running the Qwen output back through a second sampling process using a different model, Flux.1 Krea or Wan2.2 being the models typically suggested. This feels like a bit of faff but I thought I would give it a try.
Testing this becomes very subjective as there are many factors and settings involved so it is more of a proof of principle as I was a little skeptical as to how well this would work, especially with Wan2.2 as I have found this model to create a more ‘video’ look than photograph.




For these tests I have used a Qwen workflow shared by Tenofas, who has created some excellent workflows. Initially I used Flux.1 Krea as the second stage, and subsequently with the latest update I used Wan2.2. It is worth pointing out that the second stage using Krea runs at about 7s per step on my machine, whereas with Wan2.2 using the Q8 model it is about 26s per step. Overall with the first and second stages, then FaceDetailer and UltimateSD Upscaler you can be looking at 40m for a generation!
I also tried the Wan2.2 Q5_K_M GGUF model to see if that was faster but that gave poor results and was actually a little slower at 27s per step!
With the images above, on a small screen you would be hard pushed to see the difference between the first image, which is just straight Qwen, and the second and third images which use the two stage process. If you click to enlarge them you should be able to see there are differences, with a little more realism in the skin detail and overall a slightly sharper image - it is subtle though and will depend on how you much you let the second stage influence the image.
As a comparison the fourth image is the same prompt using just Flux.1 Krea, composition will obviously be different so it is there just to compare detail. Krea can overdo the graininess and skin detail, whereas Qwen+Wan2.2 or Krea is perhaps a little under done, but this is adjustable by changing the denoise amount on the second stage.
Using the same approach I returned to a couple of images I created a week or two ago using Flux.1 Krea and recreated them using Qwen with Krea, then Qwen with Wan2.2






In these two cases the use of Krea or Wan2.2 after Qwen did make more of a difference. As to which is better, Krea or Wan2.2, I think that comes down to personal preference, as they have a slightly different style.
In terms of whether the two stage process is worth the time and effort then I’m not sure I would default to it as it is so slow for fairly marginal gains. In some ways I think I still prefer the pure Krea images but the Qwen versions do have a natural feel to them, looking a little more like a snapshot than a professional photograph.
Summary
Quick testing of any model is never going to be particularly conclusive as there are so many aspects to tweak and it takes a while to learn how best to use a model. There is no doubt that Qwen Image is good with text and is probably the leader, but it can still get a bit weird if you make it too complex. The header image on this post was created with Qwen and it correctly created the two separate pieces of text on the first attempt, something that would be rare with Flux.
Prompt adherence is good, is it better than HiDream? I haven’t decided. It is certainly better than Flux.1 Krea. It can take things a little too literally though, so you have to be careful what you wish for! I haven’t found a definitive answer to what the maximum number of tokens is, I have seen 1,024 tokens mentioned, but it seems to be able to deal with pretty long prompts.
The composition is generally good, but it has its moments and the anatomy is not always 100% (check out some of the hands in the festival image at the top!) Creativity wise I think it is better than HiDream, but so far I haven’t found it to be that different to Flux. That said, the ability to run longer prompts and be more specific helps guide the model in a better way, so it puts more emphasis on a good prompt.
Using the dual-model sampling approach produced better results than I expected and I can see it being useful when you have something complex or specific which needs strong prompt adherence. If the core image isn’t detailed enough then the second stage using Wan2.2 or Krea can subtly improve things, however, it is a pain to run, especially if you are limited on VRAM, and much slower than a single model approach.
Being an open-source model with published weights, LoRAs are already starting appear and they could really put Qwen Image at the front. Whether LoRA training on anything less than 24GB VRAM will be possible is not clear.
So far I have only looked at basic image creation, Qwen Image Edit has also been recently released, which I will look at in a separate post in comparison to Flux.1 Kontext.
Overall I can see myself using Qwen Image but its size and performance probably keeps it in the same category as HiDream for me at the moment. I do want to test the GGUF version though to see if that is a good trade off. Qwen Image is not so amazingly better that I want to use it all the time, so I am more likely to use it for specific tasks, such as when there is a lot of text, or when I cannot get a good result using one of the Flux models, but that view may change as I experiment further.




Ha! Qwen has been popping up in various places for me lately, but I haven't been able to find the time to look at it. So happy you've done just that. It looks like Krea is best for image then, but Qwen for text. It would be amazing if someone would come up with a model that delivered on all aspects.
I've been using Omnigen and CogStudio for the last couple of weeks, inside Pinokio. I'm absolutely amazed at how good they both are. Text to image with the multi-modal Omnigen is prompt-adherent without sacrificing quality. Then taking the image and animating it in CogStudio, well, the results have been great and very welcome, especially after months of trial and error with the big, well known T2I, I2V and V2V models.