Flux.1 Kontext (Dev) Multimodal Image Editing
Unleashing AI image manipulation locally on your computer
The Black Forest Labs suite of diffusion models have gained great popularity since their release last year, with several cloud and local versions available covering image generation, inpainting/outpainting and redux (image variation).
The latest Kontext model, which is all about image editing, was released in Pro and Max forms around a month ago but only accessible via cloud services or in ComfyUI via an API. The release of a Dev version allows it to be run locally on your own GPU and to be fully integrated into ComfyUI workflows.
My experience so far with multimodal image manipulation models, including HiDream E1, has been somewhat disappointing with promised consistency not living up to the expectations and some very odd generations. I’m hoping Flux.1 Kontext (Dev) can move things forward.
Getting Started
Firstly, the models and hardware requirements. The main Black Forest Labs Flux.1 Kontext Dev model is just under 24GB in size. There is a ComfyUI FP8 scaled version which is 12GB and quantised versions are already appearing.
In the template workflows it states that 20GB of VRAM is needed to run the FP8 version and 32GB of VRAM to run the full version. I have a 16GB RTX4060 Ti OC card (with 64GB of system RAM) so nervously started with the FP8 version but it ran fine at about 5-6 secs per iteration. Next I thought I would give the full version a try and to my surprise it also ran fine and still managed 6-7 secs per iteration, so that is what I have used subsequently.
To use the new models locally you need to ensure ComfyUI is up to date as there are some updated nodes. There is a good document on the ComfyUI wiki covering the details of the model, text encoders and VAE required, along with how to use the basic workflows which are included as templates.
I have done some testing using the example workflows but I have also integrated the new nodes into some existing workflows so I can use upsampling and other nodes which are not in the basic templates.
Simple Changes
As a first test I did the very simple switching of a colour and style of part of an image, in this case hair.


At first glance the second image is a pretty good modification, looking closely there are some slight differences in skin tone and minor changes in the background but they are small and no different to what you often see running multiple iterations of images. Other tests have shown similarly impressive results, much better that having to mess around with inpainting.
Complex Object Removal
For the second test I moved onto object removal and given basic object removal is now commonplace in a lot of tools I thought I would try a really hard image. The source is a photograph I took in Greece last year.
There are lots of people, scaffolding and even a crane. This was a tough ask.
The result isn’t perfect but it was a lot better than I expected. There are some remnants of the crane and a few pieces of scaffolding which it missed but as a first attempt from a very simple text prompt it did a very good job.
Style Transfer
Next up is style transfer, something which can be done using IP adapters although they do have a tendency to change the image structure a bit. I tested just listing a style reference in the prompt and also adding in a LoRA to assist with the style.






Style transfer was a little bit disappointing as I found it hard to get the output I wanted, this could just be a case of working on better prompts. The consistency is there, better than using an IP Adapter, but the styles were not varied enough, that’s not to say it was poor, just not as good as in other areas.
It did vary though as the style transfer below worked well converting to a more cyberpunk style.


Consistent Characters
A very powerful aspect of these multimodal modals is the potential to generate consistent characters across images without the need for LoRAs and other techniques. Previous models I have tried have struggled in this area.




The first image is the original Flux Dev source, the others are are generated with Flux Kontext with no LoRA or face swapper type tricks, just a simple prompt to change the pose. They are not perfect but for a first run with no tweaking they are fairly impressive in terms of the consistency which has been carried forward and the way the character detail has been maintained.
Text Manipulation
Flux is fairly good at text but does create some random things especially with longer text (See my post on text in AI Images) so a model which can replace and improve text would be very good. Being multimodal Kontext has features to help it understand text better than a straight diffusion model.


A very quick test on a low quality image did work OK on the second attempt - the first was close but had “Confy” rather than “Comfy”.
Combining Images
This is another area which can be achieved with IP Adapters but hopefully Flux Kontext will provide more control. It is the area I have done the least testing on so the example below is a very quick test.
It took several attempts to create the combined image as initially Kontext wanted to create two entirely separate images, eventually I managed to get it to create one image with the characters combined onto a single background.



Final Thoughts
So far from initial testing I’m very impressed with what Flux.1 Kontext (Dev) delivers in a complex area. Its ability to maintain the image structure and detail is excellent.
My learnings so far are that you have to be very explicit and clear with the prompt, just minor changes can have a big impact. The ComfyUI Wiki page linked above includes some good ‘do and don’ts’ around the prompt.
I like the fact that existing LoRAs also seem to work to some degree (although this needs more testing) and that you can integrate the nodes into existing workflows taking advantage of existing polishing techniques. Contrary to what the notes say in the workflows I have had no issues running the full model on 16GB of VRAM with good performance.
It is certainly a model I will be doing more work with.
Great article/research as usual, Chris. I'll give Kontext a try out today. With a 24GB RTX 3060 and 64 GB of RAM I'll take a chance with the full model, but won't be surprised if I have to scale it down to the FP8.