Generative AI for video and images must be more littered with hyperbole than any other subject — maybe some of it is justified. It’s almost impossible to keep on top of new models at the rate they are appearing, with each one another step forward.
I have been using Pinokio with Wan2.1 for i2v and t2v video generation, along with the variant for flf2v. Having tried the LTX, Skyreels and Hunyuan models I tend to come back to Wan2.1 as being the best performance/quality compromise for running locally. I have also experimented with Google Veo2 recently but not with any great success.
The release of more VACE (Video All-in-One Creation and Editing) models sets the scene for a better AI video creation experience, facilitating t2v, i2v, v2v, control nets, character consistency and object manipulation all in one model. A big ask.
For the little test below I used a Flux.1 Dev created image as the start for a set of VACE FusioniX extended videos. So an initial i2v and then follow-on videos of between 5s and 10s, maintaining background and character with new text prompts. As I am only just starting testing, this one doesn’t use a control net or a LoRA. I did experiment with a reference image but it was only one run and the result wasn’t what I was expecting.
I tried with Teacache enabled at 1.5x and disabled with no obvious difference in quality but a bit of speed improvement. In a similar way I tried ‘Skip Layer Guidance’ on and off and didn’t see a great change in quality but again did have some speed improvement. I need to do more thorough testing on these settings.
All of this was run locally on a machine with an RTX4060 16GB Ti OC and 64GB of system memory using the Pinokio container for Wan2.1. The runs did seem to be faster than a standard Wan2.1 generation but I haven’t properly compared yet.
If you are running directly in ComfyUI the core model is available on HuggingFace with which you will need to use, or build a workflow depending on what you are trying to do.
The test video starts well with good quality and feel but the anime styling is slowly lost as the video extends each time (roughly every 5 seconds). The colour balance and style also seems to change a bit. I tried a couple of different tests and it does look like it is the multiple video extending which slowly erodes the original style. I could not get one continuous video to work well beyond about 15-20 seconds, hence there are a couple of videos spliced together but they are all generated from the same source.
The overall composition is good but if you look closely there are issues with movement in places (blurring, smearing and unnatural movement). Some runs also produced some very weird results where the woman was floating in the air. With most models I have found fast, active movement proves challenging, particularly when combined with a complex scenario such as climbing.
I had problems getting the right prompt for some actions and camera movement, I’m not sure if that was an issue with prompt adherence or just poor prompting on my part.
I’m certainly impressed by this model and want to do some more testing, especially using some control nets, LoRAs and reference images. Worth noting that the model was quick to appear within Pinokio and the latest container enables a number of additional features including a new cache approach called Mag Cache (which I haven’t tested yet) and a collection of LoRAs. It could well become my go-to video model for the moment.