A Fast Forward in Video Generation
Video diffusion models are stepping up to change video creation
Alongside the diffusion models for straight image generation there have been some incredible strides forwards in the models for video generation recently. Video generation is a lot more intensive with much higher hardware requirements than straight image generation, limiting most current generators to only a few seconds of video at a time, however, what they are now capable of is incredible and a sign of how fast this area will evolve.
The online, closed, commercial Kling AI model has gained popularity for incredible quality and detail. It’s aiming for the professional market but you can run some lower quality free iterations. The video below was created using the basic Kling AI option from an image generated using ComfyUI and Flux, with an additional text prompt in Kling AI to describe the video motion.
Kling AI now has an option to control camera angle and a motion brush to define how objects move. Multi-element video allows you to add, delete and swap elements between videos. It is becoming a professional tool very quickly.
Not in the same league but still impressive, the open source video diffusion model space has three main players which can all be used within ComfyUI - LTX Video, Wan2.1 and Hunyuan. There is also the very recently released SkyReels-V1 from Skywork. This is an open source, fine-tuned version of Hunyuan with ComfyUI support and is next on my list to experiment with.
I have tried all three of the others but not to the level where I would say I can do a true comparison. Simplistically LTX is the speed king - much faster than Wan2.1 or Hunyuan but with compromises on quality, motion and prompt adherence. Wan2.1 is seen as the best for quality and motion and Hunyuan has a bit of a speciality with multi-person scenes. For the moment I have settled on more thorough experimentation with Wan2.1 but will be revisiting the others when I have time.
Running any of these video models locally is demanding on hardware, not only requiring a recent video card but also as much VRAM as you can afford. Wan2.1 does have several different models, with the lower resolution 480p model running on smaller VRAM cards. I have got the 480p and 720p models running on a 4060 TI OC 16GB VRAM card, however, it can easily run of out of memory especially if you start to push up the number of frames. It is also very slow, you do need patience for video generation - you can often be looking at an hour plus for a 480p generation and longer for 720p.
As an alternative I have started to use MimicPC to run ComfyUI with the video models on much higher spec hardware on an on-demand basis. This has allowed me to access 24GB and 48GB of VRAM on A10G and L40S cards. Processing is still slow in comparison to normal images but for full 720p video I am getting times of around 100 seconds per step for 81 frames, which with 20 steps gives a total generation time of around 30 minutes. Dropping the resolution down using a square or 4:3 ratio can drop the step generation down to around 50 seconds for 81 frames. Worth noting though that using 24GB of VRAM still leads to out of memory errors at 720p when you increase the number of frames above about 50.
The ComfyUI Wiki has guides on how to use the video workflows and what models to install so I’ll only mention a couple of points here. Firstly the standard workflows output to a webp file, which is somewhat annoying as most video editors do not recognise webp so you have to first convert from webp to mp4 or another format which is recognised, and there are not many good options for doing the conversion.
If you want to run the conversion locally then ImageMagick (requires FFMPEG) is one solution, although I have had issues with corruption in the output. A decent online convertor is EZGif but there are others.
Currently the maximum number of frames is officially 81, although there are plenty of reports of people using higher numbers, whether this works reliably or not is in question.
I have primarily tested i2v (image to video) so far but there also workflows for t2v (text to video) and flf2v (first to last frame video) which I have yet to experiment with. The flf2v flow is particularly interesting as it gives far more control on where the video finishes allowing for better potential of stitching together.
The video below is two separate generations stitched together using the same initial source image with different prompts.
This second video shows some impressive detail and quality but the motion, particularly of the jaws, is not quite right.
This video was generated in a 4:3 ratio (720 x 960) and took 40 seconds per step.
And finally an exploding robot just for fun - would pass for a video game animation.
It might be early days for diffusion model video generation but with such an exciting start and the speed of development it is definitely going to disrupt current approaches.
What an amazing world you live in. This is terrific stuff, thx for sharing 🙏🏼