top of page

High-Resolution Video Generation with Video Latent Diffusion Models

Image credit: Nvidia Toronto AI Lab Overview

Researchers from LMU Munich, NVIDIA, Vector Institute, University of Toronto, and University of Waterloo have collaborated to develop Video Latent Diffusion Models (Video LDMs) to enable efficient high-resolution video generation, a resource-intensive task. Video LDMs are an extension of Latent Diffusion Models (LDMs), which allow for high-quality image synthesis with reduced computational demands by training diffusion models in a compressed lower-dimensional latent space.

The researchers first pre-trained an LDM on images before transforming it into a video generator by introducing a temporal dimension and fine-tuning it on encoded image sequences or videos. The process involved temporally aligning diffusion model upsamplers to create temporally consistent video super-resolution models. The Video LDMs generated sparse keyframes at low frame rates, which were then temporally upsampled twice by another interpolation latent diffusion model.

The team focused on two real-world applications: simulating in-the-wild driving data and creative content creation with text-to-video modeling. Their Video LDM achieved state-of-the-art performance in generating high-resolution driving videos (512 x 1024) and text-to-video synthesis with resolutions up to 1280 x 2048 using Stable Diffusion as the backbone image LDM.

The researchers also demonstrated personalized text-to-video generation by transferring the learned temporal layers to other image LDM backbones, such as those fine-tuned with DreamBooth. Additionally, they explored convolutional-in-time synthesis to create slightly longer videos without compromising quality significantly.

These advancements in Video LDMs open up exciting possibilities for future content creation, making high-resolution video generation more efficient and accessible. Source

bottom of page