OmniGen2: Unified Multimodal Model
Are Dual Distinct Decoding Pathways the route to the perfect model?
As diffusion models develop it feels like it becomes increasingly subjective to compare them as there are so many different factors - the prompt, the workflow, the settings, the particular type of image and personal preference. We all have a slightly different take on what we think looks best. It’s also about what works best for you and your set-up, be it hardware limitations or the need for rapid mock-ups.
Currently I’m primarily working with Flux variants and HiDream, whilst also using Flux Kontext and Fill. My foray into OmniGen2 is more recent and I mention this as I feel I may have a subconscious bias towards Flux maybe as I have some good workflows running and it performs well on my set-up.
OmniGen2 Basics
OmniGen2 is a new model which heads down the ‘unified’ route claiming both image generation and image manipulation, as opposed to Flux which has several models - Pro/Dev/Schnell for image generation, along with Flux Fill and Flux Kontext for image manipulation.
Physically OmniGen2 comes in just under 8GB in size and although documentation refers to needing 17GB of VRAM to run I have had no issues on 16GB and in fact it runs pretty quickly - quite a bit faster than Flux Dev and Kontext.
OmniGen2 is a 7 billion parameter model compared to Flux Dev at 12 billion and HiDream at 17 billion parameters, but do these numbers really matter?
Alongside being a unified model, the two other big headlines for OmniGen2 are that it is a true open source model under the Apache 2 licence and it features two distinct decoding pathways for text and image modalities. The technical blurb says it “utilises unshared parameters and a decoupled image tokenizer”, in plain English I think that just means text is processed entirely separately from the rest of the image. I have talked about the challenge of text in images in a separate post Dear AI: What Even Is ‘B1ork!’ Supposed to Mean? so any improvement is a good thing. However, Flux is generally listed as a hybrid multimodal model and HiDream as a multimodal modal, so is OmniGen2 any different or better?
The documentation on OmniGen2 talks about improved performance across four areas:
Visual Understanding: Inherits the robust ability to interpret and analyse image content from its Qwen-VL-2.5 foundation.
Text-to-Image Generation: Creates high-fidelity and aesthetically pleasing images from textual prompts.
Instruction-guided Image Editing: Executes complex, instruction-based image modifications with high precision, achieving state-of-the-art performance among open-source models.
In-context Generation: A versatile capability to process and flexibly combine diverse inputs—including humans, reference objects, and scenes—to produce novel and coherent visual outputs.
OmniGen2 is supported natively in ComfyUI, you just need to download the model and the associated clip file as it uses Qwen-VL-2.5. There are some basic workflows on the ComfyUI wiki page, along with links to the required files.
Text to Image
The first examples are straight forward image generation from a text prompt. The images on the left are from Flux Dev, the ones on the right are from OmniGen2. Obviously being two completely different models and slightly different workflows the image composition varies, what’s more important is the prompt adherence and the detailing in the image. Both the Flux and OmniGen2 versions have gone through a detailer and an upscaler with the same post-processing.


To me the OmniGen2 portrait image looks more AI generated than Flux does. It also looks softer and less detailed - it is not bad but it doesn’t have a natural feel. Maybe some workflow tweaks or refined model versions will help. It also took a couple of runs on the portrait for Omnigen2 as the first one had a random arm above the head, so anatomy details still occasionally create problems.


The mermaid image from OmniGen2 lacks the creative flow that Flux has. The style is simpler, less interesting and far less detailed. Again it is not a bad image, and if it was not side by side with Flux you would probably think it was OK.


The last image is a more significant fail. I tried it multiple times with different seeds and schedulers but could not get anything like the Flux image. The overall quality is low in terms of face, clothing and background detail, and there anatomical issues with the feet and hands.
Maybe OmniGen2 is trying to stretch too far, or I am missing something on the workflow side but based on output I have created from OmniGen2 so far I would not move away from Flux or HiDream for image generation.
Style Change
With rather mixed results on the image generation I was hoping the image manipulation would shine. The first test is a simple hair change - the first image is the source, the second is using Flux Kontext and the third is OmniGen2.



This first test was not great, I tried several times to see if I could get a better result by changing the prompt but in each case the woman’s pose changed to some degree, along with the background and setting. The essence of the face and clothing are OK, and the image itself is fine, albeit a little soft and blurry. Some of the other results were way off.
The second test was a style change, making the source image more like an oil painting. For these tests no LoRAs are used, just what the raw model can produce.


The result this time was interesting, the composition was much closer to the original with just some changes to the face. OmniGen2 has gone with a softer approach to ‘impressionist oil painting’ than Kontext with Kontext keeping closer to the impressionist style.
Next a water colour painting style.


Without a LoRA both are a little basic, Flux Kontext has more depth and detail. The OmniGen2 output is interesting but heads towards anime rather than water colour.
Lastly in this section an anime style, again with no LoRAs.


Flux has gone with a simple, safe option, although some of the background is more semi-realistic than anime. OmniGen2 has taken a more cyperpunk approach but has overdone the purple tint and lost detail in the eyes.
Overall although OmniGen2 doesn’t have the same level of finesse as Flux Kontext it produces some interesting results and perhaps as LoRAs appear for it the results will improve.
Background and Character Consistency
I have been really impressed with Flux.1 Kontext (Dev) in this area as it has consistently maintained backgrounds and character details when changing poses to the point it was almost as good as using a LoRA on faces. It’s a high bar for OmniGen2 to reach.
When changing objects, pose, etc. you have to be very specific in the prompt and OmniGen2 and Flux Kontext prefer slightly different terms so the prompt for each tends to be slightly different.



The source image is on the left, next is a Flux Kontext output and then an OmniGen2 output. The Flux prompt included “shorten sleeves” which led to the t-shirt rather than hoodie - it is quite a challenging image to manipulate as a lot of detail is hidden.
In the limited testing I have performed the background in OmniGen2 images is not maintained in the same way as Flux Kontext, it has the same style but the details are different. I have read somewhere that Flux Kontext only touches the parts of the image it needs to rather than fully recreating the image so perhaps that’s why it does such a good job.
OmniGen2 has done a reasonable job of maintaining the feel of the the image but some character features and clothing detail are not maintained as well as with Flux Kontext.



In this second example Flux Kontext again does a very good job of maintaining the background and style. The character consistency is also excellent. OmniGen2 creates a similar image feel but the background details are different. The facial consistency on this image is good, if somewhat blurry, but the lower half of the body has different jeans and no rip.
OmniGen2 in this scenario seems to sit somewhere between Flux Redux and Flux Kontext, creating similar styled images but not a true manipulation of the original image.
Image Text
As I have discussed before text is difficult for diffusion models. Flux improved over previous models but still struggles with all but the most basic of text. HiDream has improved things further, however, I sometimes use Flux Kontext to clean-up text in generated images. This test was a very basic one, a billboard with large text reading ‘ComfyUI’ with smaller text underneath reading ‘Flux vs OmniGen2’


The first image is Flux, it gets the ‘ComfyUI’ piece correct but misses the second piece of text altogether. The OmniGen2 image gets the ‘Flux vs OmniGen2’ piece correct but messes up ‘ComfyUI’. The OmniGen2 image structure and text do not quite work either, but that probably relates to the text-to-image component rather than just the text.
I could show several more examples but this post is getting rather too long already, suffice to say text representation continues to improve but is still hit and miss so it is unlikely you will get what you want first time. In the few text replacement tests I have done OmniGen2 has not been as successful as Flux Kontext.
Final Thoughts
New and updated models are always good to see and help to keep the technology evolving. In this case, for me at least, OmniGen2 is interesting but doesn’t yet deliver in the same way as Flux Dev, Flux Kontext or HiDream.
It is early days and when you look at how Flux has evolved with refined models there is plenty of potential for improvement. It is also worth noting that I have only performed limited testing and these manipulation models do require very precise prompts so others may experience different results.
In the interests of post length there are other aspects I have not covered here, object replacement for example, and the fact there is so much that one model can do from a text prompt shows how far things have come over the last year. When I started with AI image generation I thought basic infill using a mask was clever.
With training data sets being released and active development work taking place I’m sure OmniGen2 will continue to evolve and improve.
Terrific work Chris, Ty for sharing.